Hybrid Associative Memory (HAM) leverages complementary strengths of RNNs and attention. In HAM, the KV cache maintains long-range details by storing only those tokens that are unpredictable by the RNN. HAM shows strong performance relative to the Transformer at a fraction of the cache size.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.

Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.





We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.



We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.
We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.

We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.


In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.

HAM introduces a principled way to combine recurrence and attention by exploiting their complementary strengths. Rather than treating the KV cache and RNN as independent sequence-mixers, HAM enables them to work in a complementary fashion: the RNN summarizes the predictable part of the sequence, and the KV cache stores the parts that are surprising.
This approach has many appealing properties:
Competitive performance. At 800M parameters with only 50% KV cache, HAM matches or exceeds full-cache Transformers and layerwise hybrids on standard benchmarks and many long-context tasks.
Fine-grained control of the KV cache budget. The KV cache budget is a continuous, user-specified parameter with a smooth and predictable trade-off with performance—enabling practitioners to select the right operating point for their latency and memory constraints.
Leveraging complementary memory systems theory: HAM’s design aligns with Complementary Learning Systems theory in neuroscience, where distinct subsystems (hippocampus and cortex) handle fast episodic recall and slower abstract integration of experience.
Looking ahead, several directions are promising. The relatively smooth trade-off between KV cache usage and performance suggests that the threshold could be varied during test time, adapting the KV cache budget within a sequence or across tasks. The KV cache growth rate could also be scheduled to achieve sublinear growth for highly structured sequences. Finally, scaling HAM to larger model sizes and longer contexts will be important to understand whether these advantages will be more pronounced on truly long-context scenarios.
HAM introduces a principled way to combine recurrence and attention by exploiting their complementary strengths. Rather than treating the KV cache and RNN as independent sequence-mixers, HAM enables them to work in a complementary fashion: the RNN summarizes the predictable part of the sequence, and the KV cache stores the parts that are surprising.
This approach has many appealing properties:
Competitive performance. At 800M parameters with only 50% KV cache, HAM matches or exceeds full-cache Transformers and layerwise hybrids on standard benchmarks and many long-context tasks.
Fine-grained control of the KV cache budget. The KV cache budget is a continuous, user-specified parameter with a smooth and predictable trade-off with performance—enabling practitioners to select the right operating point for their latency and memory constraints.
Leveraging complementary memory systems theory: HAM’s design aligns with Complementary Learning Systems theory in neuroscience, where distinct subsystems (hippocampus and cortex) handle fast episodic recall and slower abstract integration of experience.
Looking ahead, several directions are promising. The relatively smooth trade-off between KV cache usage and performance suggests that the threshold could be varied during test time, adapting the KV cache budget within a sequence or across tasks. The KV cache growth rate could also be scheduled to achieve sublinear growth for highly structured sequences. Finally, scaling HAM to larger model sizes and longer contexts will be important to understand whether these advantages will be more pronounced on truly long-context scenarios.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.

In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.

HAM introduces a principled way to combine recurrence and attention by exploiting their complementary strengths. Rather than treating the KV cache and RNN as independent sequence-mixers, HAM enables them to work in a complementary fashion: the RNN summarizes the predictable part of the sequence, and the KV cache stores the parts that are surprising.
This approach has many appealing properties:
Competitive performance. At 800M parameters with only 50% KV cache, HAM matches or exceeds full-cache Transformers and layerwise hybrids on standard benchmarks and many long-context tasks.
Fine-grained control of the KV cache budget. The KV cache budget is a continuous, user-specified parameter with a smooth and predictable trade-off with performance—enabling practitioners to select the right operating point for their latency and memory constraints.
Leveraging complementary memory systems theory: HAM’s design aligns with Complementary Learning Systems theory in neuroscience, where distinct subsystems (hippocampus and cortex) handle fast episodic recall and slower abstract integration of experience.
Looking ahead, several directions are promising. The relatively smooth trade-off between KV cache usage and performance suggests that the threshold could be varied during test time, adapting the KV cache budget within a sequence or across tasks. The KV cache growth rate could also be scheduled to achieve sublinear growth for highly structured sequences. Finally, scaling HAM to larger model sizes and longer contexts will be important to understand whether these advantages will be more pronounced on truly long-context scenarios.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.




Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.

In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.

We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.


On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.




We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.


In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.

This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.
HAM introduces a principled way to combine recurrence and attention by exploiting their complementary strengths. Rather than treating the KV cache and RNN as independent sequence-mixers, HAM enables them to work in a complementary fashion: the RNN summarizes the predictable part of the sequence, and the KV cache stores the parts that are surprising.
This approach has many appealing properties:
Competitive performance. At 800M parameters with only 50% KV cache, HAM matches or exceeds full-cache Transformers and layerwise hybrids on standard benchmarks and many long-context tasks.
Fine-grained control of the KV cache budget. The KV cache budget is a continuous, user-specified parameter with a smooth and predictable trade-off with performance—enabling practitioners to select the right operating point for their latency and memory constraints.
Leveraging complementary memory systems theory: HAM’s design aligns with Complementary Learning Systems theory in neuroscience, where distinct subsystems (hippocampus and cortex) handle fast episodic recall and slower abstract integration of experience.
Looking ahead, several directions are promising. The relatively smooth trade-off between KV cache usage and performance suggests that the threshold could be varied during test time, adapting the KV cache budget within a sequence or across tasks. The KV cache growth rate could also be scheduled to achieve sublinear growth for highly structured sequences. Finally, scaling HAM to larger model sizes and longer contexts will be important to understand whether these advantages will be more pronounced on truly long-context scenarios.

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.




Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

Reported scores underlined.
Pass@1 scores with greedy sampling.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


Pass@1 scores with greedy sampling. Livebench 2024-11-25.
Bold: Best score at 1.5B scale w/ greedy sampling
*reported scores

Evals (reported underlined). All numbers pass@1 estimated using n=16

We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
Footnote: Training on the Eurus-2-RL dataset did not match the DeepScaleR math evaluation numbers, possibly due to lower quality synthetic math questions in NuminaMath-CoT providing a mixed training signal, or the solvability filtering process with QwQ-preview reducing the difficulty of the dataset. Additionally, the relatively small percentage of code (5%) likely led to math dominating training at the expense of code performance. Training on domain specific datasets and merging resulting models seems to be a potential way to counteract this problem, as demonstrated with SFT in Light-R1.
This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.

We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.



We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.
We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.

Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.


In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.
This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.
HAM introduces a principled way to combine recurrence and attention by exploiting their complementary strengths. Rather than treating the KV cache and RNN as independent sequence-mixers, HAM enables them to work in a complementary fashion: the RNN summarizes the predictable part of the sequence, and the KV cache stores the parts that are surprising.
This approach has many appealing properties:
Competitive performance. At 800M parameters with only 50% KV cache, HAM matches or exceeds full-cache Transformers and layerwise hybrids on standard benchmarks and many long-context tasks.
Fine-grained control of the KV cache budget. The KV cache budget is a continuous, user-specified parameter with a smooth and predictable trade-off with performance—enabling practitioners to select the right operating point for their latency and memory constraints.
Leveraging complementary memory systems theory: HAM’s design aligns with Complementary Learning Systems theory in neuroscience, where distinct subsystems (hippocampus and cortex) handle fast episodic recall and slower abstract integration of experience.
Looking ahead, several directions are promising. The relatively smooth trade-off between KV cache usage and performance suggests that the threshold could be varied during test time, adapting the KV cache budget within a sequence or across tasks. The KV cache growth rate could also be scheduled to achieve sublinear growth for highly structured sequences. Finally, scaling HAM to larger model sizes and longer contexts will be important to understand whether these advantages will be more pronounced on truly long-context scenarios.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.


We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.


On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.

We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.
On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.

In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.

This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.
HAM introduces a principled way to combine recurrence and attention by exploiting their complementary strengths. Rather than treating the KV cache and RNN as independent sequence-mixers, HAM enables them to work in a complementary fashion: the RNN summarizes the predictable part of the sequence, and the KV cache stores the parts that are surprising.
This approach has many appealing properties:
Competitive performance. At 800M parameters with only 50% KV cache, HAM matches or exceeds full-cache Transformers and layerwise hybrids on standard benchmarks and many long-context tasks.
Fine-grained control of the KV cache budget. The KV cache budget is a continuous, user-specified parameter with a smooth and predictable trade-off with performance—enabling practitioners to select the right operating point for their latency and memory constraints.
Leveraging complementary memory systems theory: HAM’s design aligns with Complementary Learning Systems theory in neuroscience, where distinct subsystems (hippocampus and cortex) handle fast episodic recall and slower abstract integration of experience.
Looking ahead, several directions are promising. The relatively smooth trade-off between KV cache usage and performance suggests that the threshold could be varied during test time, adapting the KV cache budget within a sequence or across tasks. The KV cache growth rate could also be scheduled to achieve sublinear growth for highly structured sequences. Finally, scaling HAM to larger model sizes and longer contexts will be important to understand whether these advantages will be more pronounced on truly long-context scenarios.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.
This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.
Sequence-mixing layers are the critical element in modern language models. They are the only component in the model that lets each token relate to the rest of the sequence. The two dominant paradigms for sequence-mixing layers in modern LLMs are self-attention and modern recurrent neural networks (RNNs) such as Mamba2/3 and GatedDeltaNet (GDN). These two layer types integrate the information contained in a sequence in their internal state using orthogonal philosophies: self-attention stores details about every time-point in the sequence faithfully (the KV cache), like a scribe who notes every word in a conversation, and this results in a KV cache that grows with the sequence length. The RNN on the other hand, compresses the entire sequence into a fixed-size state, and thus, summarizes the contents of the sequence.
The Transformer architecture, which uses self-attention, shows strong performance, but this performance comes at the expense of a large memory and computational cost. During training, the computational cost of attention grows quadratically, and during inference the KV cache size grows linearly with sequence length, rapidly becoming memory-bandwidth bound.
This cost becomes ever more salient in long-context tasks which require deep and complex reasoning. The computational and memory bottleneck of the KV cache is one of the most critical issues in training and deploying modern LLMs.
On the other hand, RNNs are efficient with respect to both memory and computation—the memory required doesn’t depend on the sequence length and the computational cost is linear instead of quadratic. However, because they must compress the full context into a fixed-size state, performance inevitably degrades, especially on long-context tasks which require recalling precise details. The two methods have complementary strengths: one prioritizes precise recall, while the other prioritizes efficient summarization, which can be beneficial for generalization.
The large computational and memory cost of the KV cache has led to a significant body of work on hybrid architectures which combine these methods. The predominant class of hybrids interleaves layers of self-attention with RNNs, with a focus on reducing the computational burden of the self-attention layers, which Zyphra helped pioneer with its Zamba and Zamba2 suite of models.
Another class of hybrids combines an RNN and Transformer in the same layer. However, both these approaches do not exploit the complementary strengths of these two sequence-mixing methods, rather they naively do both in parallel.
In this work, we propose a novel framework, which we call Hybrid Associative Memories (HAM), which explicitly uses the RNN state and the KV cache in a complementary way. Specifically, we combine RNNs and self-attention so that the RNN summarizes the contents of a sequence, while the KV cache only stores the parts of the sequence that the RNN cannot predict. Effectively, we can think of the KV cache as a notebook or cheat-sheet in which the model selects and stores specific, precise details whereas the RNN can capture the general meaning of the sequence.

This expression makes the growing cost of self-attention explicit: to compute the current output, the model must retain keys and values for all past positions.
and the output is given by
This update rule is derived from an optimization problem where the state update is optimized to reduce an online prediction error.
We illustrate the routing of the surprising tokens to the KV cache using an example from the “Needle in a Haystack” (NIAH) task where a needle is hidden in between a longer sequence. As we can see the routing metric correctly spikes at the needle and these tokens are routed to the KV cache. Natural language is messier and more complex and we study the behavior of the routing for these sequences in the technical paper.

We compare HAM against parameter-matched baselines at the 800M scale: a standard Transformer, a pure Gated DeltaNet (GDN), and a stacked hybrid that interleaves GDN with global self-attention layers (GDN-GSA). All HAM variants use only 50% of the KV cache relative to the Transformer and are memory- and compute-matched to the GDN-GSA (but not the GDN or Transformer). Models are trained on 50B tokens from the Long Data Collections dataset with a context length of 16,384. We note that the Transformer baseline in our test has a larger dimension for the KV cache compared to GDN-GSA and HAM.

On the standard language-modeling and commonsense suite, the learned-router HAM variants achieve the best overall average score reported in the table: 49.1, versus 48.8 for the Transformer baseline and 48.4 for the GDN-GSA hybrid. The learned router posts the strongest ARC-e score in the table at 50.8, while the EDA variant reaches the best BoolQ score at 59.7 and the best WikiText perplexity at 19.49. This shows that HAM is competitive on ordinary next-token modeling and zero-shot reasoning tasks.
The long-context results are more nuanced: HAM is not uniformly best at all RULER tasks; however, at a matched 50% cache budget, HAM is especially strong on the retrieval settings where interference might matter most: the multikey (MK), multiquery (MQ) and multivalue (MV) tasks. The learned-router model reaches 67/45/19 on MK2 at 4k/8k/16k, substantially above the Transformer’s 28/37/14 and far above the GDN-GSA hybrid’s 15/12/2.6. The EDA variant also recovers very strong single-2 performance. In the paper, we also show how these results vary as we tune the fraction of tokens routed to the KV cache.

In Figure 3, we see that as the KV fraction increases, loss decreases in a smooth, predictable way, instead of a sharp cliff; long-context eval scores show similar trends up to a point, but we observe a slight decrease for larger KV cache sizes, where interference may be greater.
Figure 7 shows that the global KV cache target is reached rapidly, and that different layers have very different KV cache usage under the global target. HAM offers the possibility of probing or setting the KV cache capacity of different layers in a continuous and precise manner.

This level of fine-grained, continuous control over KV cache size has been difficult to achieve in prior architectures. Stacked hybrids can only reduce KV usage by removing attention layers—a discrete, coarse-grained choice made before training. Hybrid-head models can use sliding-window attention, but these are static decisions. Post-hoc KV cache compression methods (SnapKV, PyramidKV, Ada-KV, etc.) provide budget control at inference time, but the model was trained with a full cache and has no complementary memory to absorb evicted tokens. In HAM, the RNN explicitly captures compressible context, ensuring that eviction and compression are coordinated and learned end-to-end from the start of training.
HAM introduces a principled way to combine recurrence and attention by exploiting their complementary strengths. Rather than treating the KV cache and RNN as independent sequence-mixers, HAM enables them to work in a complementary fashion: the RNN summarizes the predictable part of the sequence, and the KV cache stores the parts that are surprising.
This approach has many appealing properties:
Competitive performance. At 800M parameters with only 50% KV cache, HAM matches or exceeds full-cache Transformers and layerwise hybrids on standard benchmarks and many long-context tasks.
Fine-grained control of the KV cache budget. The KV cache budget is a continuous, user-specified parameter with a smooth and predictable trade-off with performance—enabling practitioners to select the right operating point for their latency and memory constraints.
Leveraging complementary memory systems theory: HAM’s design aligns with Complementary Learning Systems theory in neuroscience, where distinct subsystems (hippocampus and cortex) handle fast episodic recall and slower abstract integration of experience.
Looking ahead, several directions are promising. The relatively smooth trade-off between KV cache usage and performance suggests that the threshold could be varied during test time, adapting the KV cache budget within a sequence or across tasks. The KV cache growth rate could also be scheduled to achieve sublinear growth for highly structured sequences. Finally, scaling HAM to larger model sizes and longer contexts will be important to understand whether these advantages will be more pronounced on truly long-context scenarios.