Paper Trail (#3): A Check-in with Stanford’s Hazy Research Group

Explore the groundbreaking work of Stanford's Hazy Research group in advancing transformer-based neural networks and its implications for AI governance.

Eli Sherman, Ph.D
Evi Fuelle
Ian Eisenberg

The AI Governance Research team here at Credo AI has grown to admire the technical contributions of the Hazy Research group, a deep learning lab at Stanford led by Professor Christopher Ré. We think their work is impactful and representative of cutting edge advances in transformer-based neural networks. We also love that they take the time to educate technical non-experts about the why, what, and how of their work through their research blog.

In this week’s Trail, we’ll summarize a few of Hazy Research’s recent efforts to push the frontier of transformer-based neural networks and discuss the relevance for AI governance and AI-specific regulation.


Plenty of ink has been spilled about the ‘frontier’ of AI models. Most discourse views the frontier in the context of capabilities: how performant and general are leading models like OpenAI’s GPT-4 or Anthropic’s Claude 2? What can they do out of the box? What can advanced prompt engineering or tool use enable them to do? What will the next state of the art model (e.g. Deepmind’s Gemini or an anticipated GPT-5) look like? This focus on capabilities, combined with the discovery of “scaling laws” has focused most of the field on compute and ever more expensive training runs.

The reality, however, is that capabilities research captures just one dimension of the technological frontier, and compute is just one aspect of AI along with data and algorithms (the “AI triad”). Stanford’s Hazy Research group is engaging in novel methods research to push forward other critical dimensions of the frontier, like computational performance and latency, model size, and context length, all of which advance the practicality of AI. Their focus showcases the material role that algorithmic advances will play in how AI evolves, demonstrated by their innovative applications in the AI-for-Science domain.

Key Takeaways

  • Context window lengths, the total number of words an AI model can process at one time,  are key determinants of the usability and performance of state of the art foundation models. There will be a constant desire for larger context windows.
  • Hazy Research is a lab at the forefront of AI algorithmic advances, pushing forward algorithmic efficiency and unlocking new applications.
  • As one example, Hazy Research has proposed a novel architecture, Hyena, for sequence modeling, which supports 1 million token context windows. This is orders of magnitude beyond commercially available AI systems.
  • Huge context windows open up new applications in science, including genomics research.

Takeaways for Policy Makers

  • The research community does not currently have adequate approaches to evaluate frontier model capabilities directly. It has been suggested that “compute” (the computational resources needed to train a model) can serve as a reasonable proxy for model capabilities.
  • Consequently, some AI governance approaches suggest imposing requirements on AI development that uses large amounts of compute, as a way to regulate “frontier” models (model’s whose high compute usage would predict best-in-class capabilities).
  • This approach may prove ineffective long-term, because algorithmic advances like those developed by the Hazy Research Lab will make it possible to create highly capable systems with less compute.
  • These developments predict a future where powerful AI models are truly ubiquitous, accessible to even the lowest-resourced organizations. More and more people will be able to run powerful models using hardware that is easily accessible, like personal laptops or smartphones.
  • If governance is going to connect with the actual capabilities and risks of powerful AI systems, we will need better, more direct evaluations of capabilities, to complement proxy measures like compute usage.

Paper Summaries

FlashAttention and FlashAttention-2

One of the most important factors in determining a transformer-based model’s usability is its context window. We’ve seen competitive pressures to advance model context windows in the market, such that models like Claude 2 support context lengths of ~75,000 words enabling  it to process several  documents at a time. However, increasing the context window size goes beyond improving usability - it also helps model performance. Model performance generally increases with context window size because training on long sequences helps models get a stronger grasp on intra-sequence relationships. As an example, imagine trying to learn about History by reading a textbook. It would be much easier to parse the flow of events and cause and effect if you could remember and reason over several paragraphs at a time rather than merely a few sentences. Unfortunately, because the “attention” component of transformer architectures experiences quadratic theoretical scaling in memory and runtime – increasing a model’s context length by a factor of 2 takes about 4x as long to train and requires about 4x as much RAM – training larger models is prohibitively expensive.

To tackle this issue, the Hazy group observed that existing implementations of attention, like the one in PyTorch, are intentionally generic: they’re designed to work on a wide variety of computer hardwares, rather than being tailored to training with state of the art GPUs, leading to memory usage inefficiencies. In their original FlashAttention paper, the Hazy group introduced a parallelization strategy for GPU data utilization that led to 2.5x speedups for sequences up to 8k tokens with no performance dropoff, an algorithmic improvement that was subsequently adopted by virtually all the major AI labs.

More recently, the lab released FlashAttention-2. This update makes further optimizations by changing the way data are processed during training to reduce the number of non-matrix multiplication (‘matmul’) operations. Matmul operations are about 16x faster on modern GPUs than non-matmul operations. By cleverly assigning training samples to different areas of the GPU for training, and making other hardware-specific operations, the authors achieve a 2x speedup over FlashAttention and reach 72% GPU utilization, compared to a maximum of 40% for FlashAttention and less than 20% for the original PyTorch implementation.

Hyena and HyenaDNA

Despite the advantages that FlashAttention brings to the table, the attention operation still fundamentally scales quadratically with context window. In practice, that means that most large language models are limited to about 8k tokens of context, with some high profile examples (developed by high-resource organizations) reaching up to 32k, 65k, and even 100k tokens. Nevertheless, the question stands: how can we get more? Say, 1M tokens?

The Hazy Research group tackled this challenge by developing an alternative to the “attention” algorithm. The Hazy group performed several experiments to study the power of various previously-proposed alternative architectures for sequence modeling, as well as a novel architecture called Hyena. Hyena approximates the data-dependent component of “attention” – i.e., the part that takes into account past sequence information when predicting the next token in the sequence – with much simpler convolutional layers and the fast fourier transform. These alternative operators were much more computationally efficient than “attention”, scaling sub-quadratically. The authors found that models trained with this architecture are up to 100x faster at the 100k token length and, moreover, the models are able to match the training loss of similarly-sized GPT-style models with much less training time.

Hazy Research has followed up this work with a recent paper studying the applications of super-long context modeling. In HyenaDNA, the group uses the Hyena architecture to train a foundation model for genomics on the Human Reference Genome. This is among the first examples of foundation models being used for advancing science and it dramatically extends the state of the art in biological modeling, from ~4k tokens to 1M tokens. They applied the trained model to 28 genomics tasks (e.g., various gene expression prediction benchmarks) and achieve state of the art on 23.

Why does this matter?

Smaller and faster is the name of the game

Efforts like FlashAttention and Hyena are and will continue to make AI more accessible. Models like GPT-4 can only be run on specialized, hard-to-access, expensive hardware. FlashAttention lowers the cost to train a model from scratch or fine-tune a foundation model, meaning more people and organizations will be able to build models customized to their specific needs.

Advances like Hyena and Monarch Mixer (not covered in this blog) enable developers to train models that are smaller, with negligible impacts on performance. This could mean the difference between running a model in the cloud and running the model on a user’s personal device, which has all sorts of benefits, like lower cost, lower latency, and enhanced data privacy.

Compute multipliers push the frontier

While algorithmic efficiency allows you to achieve a certain level of performance with less compute, it also allows you to get more performance with the same compute. Anthropic’s CEO Dario Amodei calls these kinds of algorithms “compute multipliers” as they let you do more with the same compute budget - which you can think of as having “more” compute. Pushing the frontier of model capabilities depends on vast computational resources being used as efficiently as possible.

AI for Science

While most recent attention in generative AI has focused on consumer-facing applications, like ChatGPT and Bing, researchers have long anticipated the potential for AI to drive scientific discovery forward and enable rapid research progress. Work by Hazy Research, which combines existing neural network ideas with novel efficiency approaches and a novel application, is an example of the steady progress towards this vision. Advances like the ones highlighted here will become more frequent as the field matures and will accelerate advancements in downstream applications.

How is it relevant to AI Governance?

Scale and speed yields ubiquity

For organizations thinking about adopting AI, especially generative AI, optimizations like those studied by the Hazy Research group will enable better tailoring of models. This may help alleviate some common concerns we hear from customers, such as operational risks arising from poor performance. It also means that implementing AI systems into existing products will be dramatically easier. This brings new challenges to areas like procurement, where purchasing organizations will have to adopt increasingly bespoke vetting processes to meet the customization of the models and applications they’re being sold.

Ubiquity complicates enforcement

While Hazy Research is not embedding their models into specific applications or commercial use cases, their findings are being adapted and adopted by others in the AI value chain, both in research and in commercial deployments. Ubiquity of generative and general purpose AI makes getting regulation ‘right’ challenging. Regulatory frameworks need to be able to properly assign responsibility for risk mitigations and liability for realized harms to the entities who have control over the harmful system. At present, only a small handful of players (e.g. OpenAI) control the most capable models and access to these is generally closed. However, increasingly capable open sourced models are being developed, which obviates the need for individual, low-resourced developers to amass the compute resources necessary for pre-training. Simultaneously, technical advancements in efficiency and context length dramatically lower the bar for deploying a generative AI-powered system. Coupled together, these trends will eventually enable virtually anyone to develop and deploy a generative AI-powered system. Regulation targeted solely at organizations with large user bases or meeting specific revenue thresholds, as was the case in the EU’s Digital Services Act, may therefore inadequately address the possibility of grassroots risks. Policy makers need to carefully craft regulation that addresses the distribution and risks of AI today and anticipates the impending spread of AI development capabilities to the masses.

About Credo AI

Credo AI is on a mission to empower enterprises to responsibly build, adopt, procure and use AI at scale. Credo AI’s cutting-edge AI governance platform automates AI oversight and  risk management, while enabling  regulatory compliance to emerging global standards like the EU AI Act, NIST, and ISO. Credo AI keeps humans in control of AI, for better business and society. Founded in 2020, Credo AI has been recognized as a CBInsights AI 100, Technology Pioneer by the World Economic Forum, Fast Company’s Next Big Thing in Tech, and a top Intelligent App 40 by Madrona, Goldman Sachs, Microsoft and Pitchbook.

Ready to unleash the power of AI through governance? Reach out to us today!