Back to Newsroom
Reaching 1B context length with RAG
October 21, 2024
PALO ALTO, CALIFORNIA

We demonstrate a retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time. In this post, we share results demonstrating that our approach (novel retrieval method based on sparse graphs) achieves SoTA performance on the Hash-Hop benchmark, which requires reasoning over elements in an ultra-long context.

Our model excels up to 1 billion context (and beyond), is more compute & memory efficient compared to common RAG systems that use dense embeddings, and is more efficient than long-context transformer-based LLMs.

These preliminary results suggest our algorithm is a promising approach for performing long-context tasks especially in compute constrained scenarios (on device, cost-effective on-prem & cloud etc).

Authors
Nick Alonso, Beren Millidge
Collaborators
Daniel A Roberts (Sequoia Capital & MIT), Andrey Gromov (Meta FAIR), Kushal Tirumala (Meta FAIR) and Hassan Shapourian (Cisco)