Zyphra Research
Zyphra's research thesis focuses on three pillars to deliver superintelligence.
Browse More
All Technology
Research
Models
Press
Press
Zyphra and AMD Partner to Power Zyphra Cloud on AMD Instinct™ MI355X GPUs
Zyphra announced Zyphra Cloud, a full-stack AI platform on AMD powered by Tensorwave. The platform launches with Zyphra Inference, a serverless inference service for frontier open-weight models focused on long-horizon agentic workloads.
Research
Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training
Zyphra presents Tensor and Sequence Parallelism (TSP), a novel parallel sharding strategy for training and serving long-context transformer models.
Research
The Zyphra Inference Cloud: AMD-First Inference for Long-Context, Agentic Workloads
Zyphra Cloud is a full stack AI platform bringing advanced innovations from Zyphra Research into production for developers, enterprises, and frontier AI hyperscalers. Today, we launch the platform with Zyphra Inference, an AMD-first inference service purpose-built for large open models focused on long-context agentic workloads. Zyphra Inference marks the first step toward a unified platform for open, sovereign AI at scale.
Research
Hybrid Associative Memories
Hybrid Associative Memory (HAM) leverages complementary strengths of RNNs and attention. In HAM, the KV cache maintains long-range details by storing only those tokens that are unpredictable by the RNN. HAM shows strong performance relative to the Transformer at a fraction of the cache size.
Press
Zyphra Releases ZUNA - BCI Foundation Model Advancing Towards Thought-to-Text
New brain-computer interface AI model improves real-world EEG data while advancing Zyphra's mission to develop human-aligned superintelligence.
Models
ZUNA: BCI Foundation Model Advancing Towards Thought-to-Text
ZUNA is a 380M-parameter BCI foundation model for EEG data, a significant milestone in the development of noninvasive thought-to-text. ZUNA reconstructs, denoises, and upsamples EEG data across arbitrary channel layouts and is built for researchers, clinicians, and BCI developers using real world data.
Research
Online Vector Quantized Attention
In this blog, we describe a novel sequence mixing layer developed here at Zyphra that aims to find a better compromise between memory-compute costs and long-context capabilities than standard sequence mixing layers. We call this layer Online Vector-Quantized (OVQ) attention.
Press
Zyphra Demonstrates First Large Scale Training on Integrated AMD Compute and Networking Powered by IBM Cloud
Joint collaboration between Zyphra, AMD, and IBM delivers ZAYA1, the first large-scale Mixture-of-Experts foundation model trained entirely on an AMD platform using AMD Instinct MI300X GPUs, AMD Pollara networking & ROCm software.
Models
ZAYA1 – Pretraining on Integrated AMD Platform: Compute, Network, and System Design
Zyphra announces a preview of ZAYA1, the first AI model trained entirely end-to-end on AMD’s hardware, software, and networking stack. Details of our pretraining efforts, hardware specific optimizations, and ZAYA1-base model benchmarks are described in the accompanying technical report published to arXiv.
Research
CCA - Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
Zyphra shares its research into a novel attention variant: Compressed Convolutional Attention (CCA). CCA dramatically reduces the compute, memory, and parameter costs of self-attention while matching or exceeding the performance of existing methods. Details of CCA and its grouped-query variant CCGQA are described in the accompanying technical report published to arXiv. CCGQA has been subsequently used to train the ZAYA suite of language models.
Models
Introducing ZR1-1.5B, a small but powerful reasoning model for math and code
We introduce ZR1-1.5B, a small reasoning model trained extensively on both coding and mathematics problems with reinforcement learning. ZR1-1.5B outperforms many significantly larger general non-reasoning models on code generation, while maintaining performance close to state-of-the-art small reasoning models trained exclusively on math on competition-level evaluations.
On LCB_Generation ZR1-1.5B achieves parity with Claude3-Opus and Gemma2-27B, while on competition math ZR1-1.5B outperforms Qwen2.5-72B. Unlike comparable reasoning models, ZR1-1.5B requires significantly shorter reasoning traces, using 60% fewer tokens than R1-Distill-1.5B and 53.5% fewer tokens than DeepScaleR.
Overall ZR1-1.5B demonstrates strong generalization across disparate domains as well as coherent and efficient reasoning traces compared to models of a similar scale.
Models
Beta Release of Zonos-v0.1
We are excited to announce the release of Zonos-v0.1 beta, featuring two expressive and real-time text-to-speech (TTS) models with high-fidelity voice cloning. We are releasing our 1.6B transformer and 1.6B hybrid under an Apache 2.0 license.
Research
The Mixture-of-PageRanks Retriever for Long-Context Pre-Processing
In this post, we describe our Mixture-of-PageRanks RAG system, which is built to perform long-context tasks in a highly computationally efficient manner. We describe key features of the algorithm and the SOTA results it achieves across a variety of long-context benchmarks. MixPR can augment any existing foundation model, robustly outperforms frontier long-context models on a variety of benchmarks and can extend effective LLM context lengths into the billions while being able to run efficiently on CPU.
Research
Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators
In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators.
Research
Reaching 1B context length with RAG
We demonstrate a retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time. In this post, we share results demonstrating that our approach (novel retrieval method based on sparse graphs) achieves SoTA performance on the Hash-Hop benchmark, which requires reasoning over elements in an ultra-long context.
Our model excels up to 1 billion context (and beyond), is more compute & memory efficient compared to common RAG systems that use dense embeddings, and is more efficient than long-context transformer-based LLMs.
These preliminary results suggest our algorithm is a promising approach for performing long-context tasks especially in compute constrained scenarios (on device, cost-effective on-prem & cloud etc).
Models
Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator
Zyphra is excited to release Zyda2, a 5-trillion token dataset composed of filtered and cross-deduplicated DCLM, FineWeb-Edu, Zyda-1, and Dolma v1.7's Common Crawl portion. Leveraging NVIDIA NeMo Curator, we've dramatically accelerated data processing from 3 weeks to 2 days while reducing costs. Zyda-2 powers our Zamba2 series, pushing the boundaries of small-LLM performance and reinforcing Zyphra's position at the forefront of efficient, high-performance language model development.
Models
ZAMBA2-7B
Zyphra is excited to release Zamba2-7B, a state-of-the-art small language model. At the 7B scale, we outperform the leading models of Mistral, Google’s Gemma and Meta’s Llama3 series in both quality and performance.
Models
ZAMBA2-MINI (1.2B)
Zyphra is excited to release Zamba2-mini, a state-of-the-art small language model. Zamba2-mini achieves highly competitive evaluation scores and performance numbers and fits in a tiny memory footprint of <700MB at 4bit quantization. 7x drop in params for same performance ; Zamba2- mini (1.2B) ~ Llama2 7B
Research
The Zyphra Training Cookbook
Training hybrid models is hard, and papers tend to gloss over the practical engineering work that goes into building good ones. The purpose of this cookbook is to enable other technical groups to hit the ground running when building their own hybrid (SSM, Transformer, MoE) models.
Research
Understanding Graph-based RAG and Multi-Hop Question Answering
This blog post discusses the relation between multi-hop question-answering and retrieval from graph-based databases. In particular, we develop a mathematical explanation for why graph databases are useful for answering multi-hop questions. We then implement a simple graph database to augment GPT-4o. We test our RAG system on a new needle-in-the-haystack dataset, called Babilong, and find our system is the best performing model thus far among models not fine-tuned on the dataset.
Research
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Zyphra is excited to announce Tree Attention, a novel method for efficiently parallelizing multi-GPU transformer decoding with significant advantages in speed and memory. For instance, we estimate that Tree Attention can decode at the 1M sequence length over 8x faster than existing Ring Attention while requiring 2x less communication volume or more. Moreover, Tree Attention achieves an asymptotic advantage over Ring Attention in the number of devices so the benefit increases dramatically for larger clusters.
Models
Zamba2-Small (2.7)
Zyphra is excited to release Zamba2-small, a 2.7B state-of-the-art (SOTA) small language model for on-device applications.
Models
Zyda
Zyphra is pleased to announce Zyda, a 1.3T trillion-token open dataset for language modeling. Zyda combines the existing suite of high-quality open datasets together and merges them through a uniform and thorough filtering and deduplication process. The goal of Zyda is to provide a simple, accessible, and highly performant dataset for language modeling experiments and training up to the 1 trillion scale. In our ablation studies, Zyda outperforms all existing open datasets including the Dolma, Fineweb, Pile, RefinedWeb, and SlimPajama.
Research
Toward Conversational Agents with Context and Time Sensitive Long-term Memory
There has recently been growing interest in conversational agents with long-term memory which has led to the rapid development of language models that use retrieval-augmented generation (RAG). Until recently, most work on RAG has focused on information retrieval from large databases of texts, like Wikipedia, rather than information from long-form conversations. In this paper, we argue that effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval: 1) time/event-based queries, which requires the model to retrieve information about previous conversations based on time or the order of a conversational event (e.g., the third conversation on Tuesday), and 2) ambiguous queries that require surrounding conversational context to understand. To better develop RAG-based agents that can deal with these challenges, we generate a new dataset of ambiguous and time-based questions that build upon a recent dataset of long-form, simulated conversations, and demonstrate that standard RAG based approaches handle such questions poorly. We then develop a novel retrieval model which combines chained-of-table search methods, standard vector-database retrieval, and a prompting method to disambiguate queries, and demonstrate that this approach substantially improves over current methods at solving these tasks. We believe that this new dataset and more advanced RAG agent can act as a key benchmark and stepping stone towards effective memory augmented conversational agents that can be used in a wide variety of AI applications.¹
Models
ZAMBA
Zyphra is proud to release Zamba, a novel 7B parameter foundation model.
Research
The Unreasonable Ineffectiveness of the Deeper Layers
We empirically study a simple layer-pruning strategy for popular families of openweight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to “heal” the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
Research
NeuraNoC - A neuroscience-inspired packet switch network-on-chip (NoC)
Zyphra’s NeuraNoC is a pioneering packet-switched network-on-chip (NoC), named for its routing mechanism that resembles the spiking behavior of neurons in the brain by encoding processor connections as Bernoulli processes. It is the first NoC to be trained at compile time to precisely match the bandwidth requirements between connected processors, making it ideal for ML workloads with predictable and sustained bandwidth profiles. Although the packet routing in a hardware network may seem stochastic for a given connection, it is actually deterministic and predefined. This is achieved by treating the packets as carriers that may or may not contain a payload. This approach is similar to time domain multiplexing in arbitrary connectivity graphs. This NoC eliminates all memory blocks typically required in network routers and processing units for packet exchange. As a result, NeuraNoC drastically reduces the silicon footprint, allowing more space for additional compute resources or local memory.
























