AI Governance

Introducing Credo AI Labs: Where We're Building the Future of AI Governance

Credo AI Labs drives innovation in AI governance and development.

September 10, 2025
Author(s)
Ian Eisenberg
Contributor(s)
No items found.

The pace of AI innovation is relentless. Every week brings new models, new capabilities, new possibilities. As individuals experiment with AI, they discover new uses and opportunities—a groundswell of innovation that is transforming enterprises. This Cambrian explosion requires AI governance to evolve so it can provide true "scalable oversight" and direct AI toward beneficial use. If we rely on the tools and approaches of the past, AI governance will inherit the "pacing problem" of previous technological eras—where policies, standards, and approaches to governance fail to keep up with technological innovation. We cannot let that happen.

We're thrilled to introduce Credo AI Labs—our innovation playground where we're experimenting with the boldest ideas in AI governance together with forward-thinking enterprises. This isn't a roadmap announcement or a coming-soon promise. Labs is a rapid prototyping team within Credo AI, and our first prototype is already changing how enterprises interpret model evaluations.

Why Labs?

Addressing the challenge of AI governance requires three fundamental capabilities:

  1. Using AI technology to power AI governance tools
  2. Deep integration into technical observability, evaluation, and control systems
  3. Synthesizing diverse information sources and constraints for human legibility and action

Credo AI's platform already delivers on many of these needs, but some challenges require a different approach—rapid experimentation with real-world validation. The most ambitious ideas need to be tested, broken, and rebuilt in partnership with the organizations actually facing these governance challenges.

That's where Labs comes in.

Labs is our space for the experiments that need to fail fast or scale fast. It's where we can prototype an AI agent that governs AI systems, test radical new approaches to benchmark synthesis, or explore new frameworks for governance policy and control. Most importantly, it's where design partners stress-test these ideas in real scenarios before they're ready for production.

Some Labs experiments will validate quickly and move into our core platform. Others will teach us valuable lessons and disappear. Many will evolve through multiple iterations based on partner feedback. Labs lets us explore the edges of what's possible without the constraints of production commitments.

"We created Labs because AI Governance requires the same accelerative innovation as AI development,”  says Ian Eisenberg, Head of AI Governance Research. "AI Governance is a nascent and evolving function - we are actively defining best practices in parallel with effective tooling. We have the opportunity to create one of the first AI-native disciplines in the enterprise - a necessity for effective and scalable oversight of AI."

How Labs Works: Innovation Through Partnership

Here's what makes Labs different: everything is an experiment, and nothing is guaranteed.

That might sound risky, but it's actually liberating. When you remove the pressure of permanent commitments, you create space for the kind of radical thinking that AI Governance desperately needs. Labs prototypes might evolve into core platform features, or they might teach us something valuable and then disappear. That's not a bug—it's the entire point.

Labs operates on three principles:

1. Build with our Ecosystem - Design partners work directly with us, shaping prototypes from day one.

2. Fail Fast, Learn Faster - If something isn't working, we kill it quickly and move on. No sunk cost fallacy, no feature bloat.

3. Safe to Explore - By default, nothing touches production environments. No customer data, no production dependencies, no risk.

Our First Lab: Model Trust Scores (And Why It's a Game-Changer)

To show what Labs can do, let's talk about our inaugural prototype: Model Trust Scores.

Here's the problem it solves: You're evaluating foundation models for your enterprise use case. You've got 60+ benchmarks across 30+ models. MMLU says one thing, HumanEval says another, TruthfulQA suggests something else entirely. Your spreadsheet is a rainbow of scores that tell you everything and nothing at the same time.

Sound familiar?

Benchmarks provide essential data, but raw scores aren't deployment decisions. Model Trust Scores helps you make informed decisions about the most important technology in your tech stack - the foundation model you choose. It takes all that benchmark chaos and translates it into context-aware insights across four dimensions:

  • Capability - Can it perform the task?
  • Safety - Can you trust it to follow safety guidelines?
  • Speed - Response latency
  • Affordability - Aggregated input/output token count

The Technical Innovation: Context-Aware Relevance Scoring

For a deeper dive of Model Trust Score methodology, read our technical white paper.

At its core, Model Trust Scores is a continuously updated aggregation of third-party evaluations of foundation models. We bring together testing by organizations like Vals.AI, Artificial Analysis and Stanford’s CRFM into one dataset. We layer in “non-negotiable” enterprise requirements - from vendor security features to deployment support - that benchmarks alone don't capture.

The advancement is our AI-powered relevance scoring system. For each use case × benchmark pair, we evaluate relevance on a 1-5 scale separately for capability and safety dimensions. A creative writing benchmark might score 5 for marketing content but 1 for financial fraud detection. We use LLMs-as-judge to analyze benchmark descriptions and systematically match them to use case requirements.

But we don't treat this scale linearly. We transform relevance scores using a quadratic function:

relevance_weight = ((relevance_score - 1) / 4)²

This transformation reflects a key perspective: highly relevant benchmarks are exponentially more valuable than generic ones. A relevance score of 5 becomes a weight of 1.0, while a score of 3 becomes just 0.25. This matches real-world AI development where specific evaluations become increasingly critical as you move toward production. 

We then create a relevance-weighted average of the normalized benchmark scores to arrive at capability and safety measures for a particular use case. For each industry, we average across the related use cases.

Handling Missing Data Through Statistical Imputation

The evaluation ecosystem has evolved significantly in recent years, fundamentally changing how we interpret missing benchmark data. Historically, when benchmarks were primarily reported by model providers themselves, missing scores often indicated selective disclosure—providers would typically omit unfavorable results while highlighting strong performance. This pattern made missing data inherently suspicious and suggested systematic bias in reported capabilities.

Today's third-party evaluation ecosystem transforms this dynamic. Independent organizations like Vals.AI, Stanford's CRFM, and others conduct systematic evaluations across models, creating a more objective assessment landscape. In this context, missing benchmarks typically reflect practical constraints: evaluator capacity, resource limitations, or simple timing delays as new models emerge faster than evaluation cycles can accommodate. The absence of a particular benchmark score for GPT-5 or Qwen 3 most likely indicates that evaluators haven't yet processed these models rather than deliberate omission of poor results.

This shift in the underlying data generation process informs our statistical approach. We employ k-nearest neighbors (KNN) imputation to estimate missing values based on observed correlations across the benchmark landscape. KNN proves particularly suitable for this application because it preserves local structure in the high-dimensional benchmark space—models with similar architectures and training approaches tend to cluster in performance patterns. When a model demonstrates strong performance on GPQA and MATH-500, the algorithm identifies similar models in this performance space and uses their scores on missing benchmarks to generate estimates.

Critically, we preserve transparency through confidence scoring. The system calculates confidence based on two factors:

  • The completeness of available benchmark data
  • The relevance of those benchmarks to the specific use case

A model might show impressive capability scores, but if those scores derive primarily from imputed values or low-relevance benchmarks, the confidence score reflects this uncertainty. Users always know whether they're making decisions based on solid evidence or preliminary estimates. For instance, a model showing 0.7 capability might have high confidence if based on highly relevant benchmarks, or low confidence if mostly imputed.

A Public Framework for Continuous Improvement

Model Trust Scores is designed as a public utility to enhance the evolving third-party evaluation ecosystem. Out-of-the-box, we provide context-aware leaderboards for multiple industries and use cases, plus tools to calculate relevance for your specific needs. As the ecosystem evolves—with new benchmarks emerging and existing ones updating—the system incorporates these developments periodically. The relevance scoring methodology adapts to new evaluation paradigms, while the statistical framework handles varying levels of data completeness.

For organizations seeking deeper technical details—including complete mathematical formulations, imputation methodologies, and evidence strength calculations—our comprehensive technical white paper provides extensive documentation. There, readers will find discussions of normalization approaches, confidence interval calculations, and the theoretical foundations underlying our relevance scoring system.

The specific choices we’ve made here—relevance scoring, imputation, confidence metrics—evolved through iterative testing with early partners who validated results against their operational realities. Model Trust Scores demonstrates how governance innovations emerge from combining technical capability with practitioner expertise, but it represents just one application of this collaborative approach.

This Is Just the Beginning

Model Trust Scores showcases the kind of thinking Labs enables, but it's just our opening act. We're already prototyping:

  • Automated oversight workflows that catch risks before they become incidents
  • AI-assisted compliance that generates evidence and documentation automatically
  • Real-time drift detection that knows when your models start misbehaving
  • Governance co-pilots that guide teams through complex regulatory requirements

Some of these will ship. Some won't. That's the beauty of Labs—we can try things that seem impossible, and sometimes discover they're not.

Becoming a Design Partner: Join Us in Building the Future

Here's our invitation: Come build with us.

As a Labs design partner, you're not a beta tester or an early adopter. You're a co-creator. You get:

  • Access to future labs in development
  • First look at every new prototype we release 
  • Direct influence on our Labs development
  • Regular sessions with our research and product teams to shape the roadmap
  • The chance to define what AI governance becomes

We're not looking for everyone. We're looking for organizations that share our belief that governance can be different. Better. Maybe even exciting.

If you're tired of governance feeling like a tax on innovation...
If you believe AI can help govern AI...
If you're ready to reshape current workflows to scale with AI…

Welcome to Labs.

What's Next?

Starting today, Credo AI Labs is accepting applications for our design partner program. We're keeping the cohort small to ensure meaningful collaboration—this isn't a mass beta program where your feedback disappears into a void.

If you're selected, you'll get a direct line to our innovation team. You'll help decide which experiments we pursue next. And you'll be part of the group that defines how AI governance evolves.

The future of AI governance isn't going to be built in isolation. It's going to be built by practitioners, for practitioners, and require the input of an entire ecosystem.

We're building it in Labs. Want to join us?

[Apply to become a design partner →]

DISCLAIMER. The information we provide here is for informational purposes only and is not intended in any way to represent legal advice or a legal opinion that you can rely on. It is your sole responsibility to consult an attorney to resolve any legal issues related to this information.