We demonstrate a retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time. In this post, we share results demonstrating that our approach (novel retrieval method based on sparse graphs) achieves SoTA performance on the Hash-Hop benchmark, which requires reasoning over elements in an ultra-long context.
Our model excels up to 1 billion context (and beyond), is more compute & memory efficient compared to common RAG systems that use dense embeddings, and is more efficient than long-context transformer-based LLMs.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks especially in compute constrained scenarios (on device, cost-effective on-prem & cloud etc).
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
When attempting to apply RAG to long-context tasks, there are two central challenges:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.
An important and open problem in AI is building LLMs that can effectively use long contexts. Long-contexts allow LLMs to learn from large datasets at inference time via the LLM’s in-context learning abilities, rather than during an expensive pre-training phase that tunes parameters.
Building effective long-context LLMs could, for example, allow for LLMs that can summarize, reason over, and answer questions about vast inputs such as code bases or books without the need to alter the model’s parameters.
Current state-of-the-art transformer-based LLMs with long-context (e.g., Google’s Gemini-1.5 at 1M tokens) often perform well on long-context benchmarks, but have the significant drawback of requiring enormous compute, since they use attention layers whose compute cost increases quadratically with context length. This cost limits scalability & deployment flexibility where compute is limited (on-device, cost constrained on-prem etc).
State space models (SSMs) offer a more computationally efficient alternative to transformers, but are known to struggle on certain important categories of task (e.g., recall and in-context learning) relative to transformers. At Zyphra, we have also been addressing this challenge with hybrid SSM-transformer models which can achieve the majority of the speedup of pure SSMs while counteracting their weaknesses with a few carefully selected attention layers. However, SSM-hybrids do still maintain some attention and hence its quadratic cost. In addition, for extremely long context, even the linear forward pass per token of a large model can be computationally prohibitive.
A new alternative model, called the Long-Term-Memory (LTM) model, recently presented by the start-up Magic, is purportedly 1000x cheaper than transformers with naive attention and can process very long contexts on a single H100 GPU. The LTM model was shown to perform well on a new long-context benchmark, “Hash-Hop”, on up to 100 million tokens. Though promising, the LTM model is not open-sourced or accessible and is said to require a full rewrite of the inference and training stack.
At Zyphra, we are exploring using RAG as an effective, efficient, and simple alternative for long-context tasks. In this post, we extend our prior work and present a RAG system that can solve the Hash-Hop task for inputs of at least 1 billion tokens. Our RAG system inputs only a small subset of the long context into the LLM generator making language generation highly compute and memory efficient. Further, we go beyond prior applications of RAG to long-context tasks (e.g., here) by focusing on generating vector databases for retrieval quickly at test time, while simultaneously dealing with complex reasoning tasks, like Hash Hop, through the use of a novel, sparse graph-based retrieval method.
Sparse text embeddings allow us to generate embeddings quickly and run a variant of our previous personalized page-rank graph retriever entirely on the CPU. The result is a RAG system that uses very little memory, is highly compute efficient, and outperforms Magic’s LTM on the Hash-Hop benchmark.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks, especially in compute constrained scenarios. Because our algorithm is so computationally efficient, it can process effectively unlimited contexts, and we demonstrate strong performance up to 1B context – a 1000x increase over the current leading long-context systems such as Google’s Gemini.
When attempting to apply RAG to long-context tasks, there are two central challenges:
To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval; in particular:
We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our sparse PPR-RAG model performs close to perfect on the task up to and including 1 billion input tokens, all but matching Magic’s LTM at context lengths up to 16M and outperforming the LTM at lengths greater than 16M. Magic did not test their model on lengths past 100 million tokens, but their results show LTM model performance was trending downward after 16M tokens. It is reasonable to suspect this downward trend would continue to 1 billion, whereas performance of our model remains constant.
These results are significant since Magic’s LTM model was likely trained directly to perform this task and was reported to require re-engineering of the ‘full-stack’ utilized in training just to implement and train the model. Our method, on the other hand, requires no such re-engineering, is compute efficient and requires very little memory, and does not touch the LLM component at all, entirely preserving its language capabilities.
We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’ (see figure 1).
Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.
However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.
We implement and test a standard vector search model that uses sparse embeddings (std-RAG-Sparse) and our sparse personalized-page rank based retriever (PPR-RAG-Sparse). Both RAG systems only retrieve 65 items from memory, which amounts to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided, so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post.
In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.
We test three RAG systems using the following three state-of-the-art dense embedders:
The embedding models are run on a RTX 4090 GPU, and text chunks are batched with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.
All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matrix multiplies on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.
However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. The sparse embedder, which uses a symbolic program to extract keyword statistics, is very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.
We believe utilizing sparse, graph based RAG models is a promising direction for dealing with long context tasks, especially in compute constrained scenarios such as on-device applications. We supported this claim by achieving state-of-the-art performance on Magic’s Hash-Hop long-context benchmark using a simple and novel sparse graph-based RAG algorithm, which can run fast and efficiently on a single CPU.
This model solved two core problems facing RAG systems for long-context tasks:
In future work, we plan to test the generality of our retriever on a wider variety of long-context tasks and provide a more detailed description of our algorithm.