Zyphra Research

Zyphra's research thesis focuses on three pillars to deliver superintelligence.

Next-Gen Architectures

Developing new model architectures for long context reasoning, long term memory, and continual learning.

Next-Gen Architectures

Developing new model architectures for long context reasoning, long term memory, and continual learning.

Next-Gen Architectures

Developing new model architectures for long context reasoning, long term memory, and continual learning.

Multimodal World Models

Training multimodal models across language, audio, and vision, unified in a shared latent space for more capable reasoning and learning.

Multimodal World Models

Training multimodal models across language, audio, and vision, unified in a shared latent space for more capable reasoning and learning.

Multimodal World Models

Training multimodal models across language, audio, and vision, unified in a shared latent space for more capable reasoning and learning.

Silicon Performance

Co-designing models and kernels across heterogeneous silicon for maximum performance and efficiency.

Silicon Performance

Co-designing models and kernels across heterogeneous silicon for maximum performance and efficiency.

Silicon Performance

Co-designing models and kernels across heterogeneous silicon for maximum performance and efficiency.

Explore our latest announcements

New

Research

The Zyphra Inference Cloud: AMD-First Inference for Long-Context, Agentic Workloads

Zyphra Cloud is a full stack AI platform bringing advanced innovations from Zyphra Research into production for developers, enterprises, and frontier AI hyperscalers. Today, we launch the platform with Zyphra Inference, an AMD-first inference service purpose-built for large open models focused on long-context agentic workloads. Zyphra Inference marks the first step toward a unified platform for open, sovereign AI at scale.

May 4, 2026

Research

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training

Zyphra presents Tensor and Sequence Parallelism (TSP), a novel parallel sharding strategy for training and serving long-context transformer models.

Press

Zyphra Demonstrates First Large Scale Training on Integrated AMD Compute and Networking Powered by IBM Cloud

Joint collaboration between Zyphra, AMD, and IBM delivers ZAYA1, the first large-scale Mixture-of-Experts foundation model trained entirely on an AMD platform using AMD Instinct MI300X GPUs, AMD Pollara networking & ROCm software.

Browse More

All Technology

Research

Models

Press

Zyphra and AMD Partner to Power Zyphra Cloud on AMD Instinct™ MI355X GPUs

Zyphra announced Zyphra Cloud, a full-stack AI platform on AMD powered by Tensorwave. The platform launches with Zyphra Inference, a serverless inference service for frontier open-weight models focused on long-horizon agentic workloads.

May 4, 2026

Research

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training

Zyphra presents Tensor and Sequence Parallelism (TSP), a novel parallel sharding strategy for training and serving long-context transformer models.

May 4, 2026

Research

The Zyphra Inference Cloud: AMD-First Inference for Long-Context, Agentic Workloads

May 4, 2026

Research

Hybrid Associative Memories

Hybrid Associative Memory (HAM) leverages complementary strengths of RNNs and attention. In HAM, the KV cache maintains long-range details by storing only those tokens that are unpredictable by the RNN. HAM shows strong performance relative to the Transformer at a fraction of the cache size.

March 25, 2026

Press

Zyphra Releases ZUNA - BCI Foundation Model Advancing Towards Thought-to-Text

New brain-computer interface AI model improves real-world EEG data while advancing Zyphra's mission to develop human-aligned superintelligence.

February 18, 2026

Models

ZUNA: BCI Foundation Model Advancing Towards Thought-to-Text

ZUNA is a 380M-parameter BCI foundation model for EEG data, a significant milestone in the development of noninvasive thought-to-text. ZUNA reconstructs, denoises, and upsamples EEG data across arbitrary channel layouts and is built for researchers, clinicians, and BCI developers using real world data.

February 18, 2026

Research

Online Vector Quantized Attention

In this blog, we describe a novel sequence mixing layer developed here at Zyphra that aims to find a better compromise between memory-compute costs and long-context capabilities than standard sequence mixing layers. We call this layer Online Vector-Quantized (OVQ) attention.

February 5, 2026

Press

Zyphra Demonstrates First Large Scale Training on Integrated AMD Compute and Networking Powered by IBM Cloud

November 24, 2025

Models

ZAYA1 – Pretraining on Integrated AMD Platform: Compute, Network, and System Design

Zyphra announces a preview of ZAYA1, the first AI model trained entirely end-to-end on AMD’s hardware, software, and networking stack. Details of our pretraining efforts, hardware specific optimizations, and ZAYA1-base model benchmarks are described in the accompanying technical report published to arXiv.

November 24, 2025

Research

CCA - Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

Zyphra shares its research into a novel attention variant: Compressed Convolutional Attention (CCA). CCA dramatically reduces the compute, memory, and parameter costs of self-attention while matching or exceeding the performance of existing methods. Details of CCA and its grouped-query variant CCGQA are described in the accompanying technical report published to arXiv. CCGQA has been subsequently used to train the ZAYA suite of language models.

October 6, 2025

Models

Introducing ZR1-1.5B, a small but powerful reasoning model for math and code

We introduce ZR1-1.5B, a small reasoning model trained extensively on both coding and mathematics problems with reinforcement learning. ZR1-1.5B outperforms many significantly larger general non-reasoning models on code generation, while maintaining performance close to state-of-the-art small reasoning models trained exclusively on math on competition-level evaluations.

On LCB_Generation ZR1-1.5B achieves parity with Claude3-Opus and Gemma2-27B, while on competition math ZR1-1.5B outperforms Qwen2.5-72B. Unlike comparable reasoning models, ZR1-1.5B requires significantly shorter reasoning traces, using 60% fewer tokens than R1-Distill-1.5B and 53.5% fewer tokens than DeepScaleR.

Overall ZR1-1.5B demonstrates strong generalization across disparate domains as well as coherent and efficient reasoning traces compared to models of a similar scale.

April 10, 2025

Models

Beta Release of Zonos-v0.1

We are excited to announce the release of Zonos-v0.1 beta, featuring two expressive and real-time text-to-speech (TTS) models with high-fidelity voice cloning. We are releasing our 1.6B transformer and 1.6B hybrid under an Apache 2.0 license.

February 10, 2025

Research

The Mixture-of-PageRanks Retriever for Long-Context Pre-Processing

In this post, we describe our Mixture-of-PageRanks RAG system, which is built to perform long-context tasks in a highly computationally efficient manner. We describe key features of the algorithm and the SOTA results it achieves across a variety of long-context benchmarks. MixPR can augment any existing foundation model, robustly outperforms frontier long-context models on a variety of benchmarks and can extend effective LLM context lengths into the billions while being able to run efficiently on CPU.

December 19, 2024

Research

Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators

In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators.

December 10, 2024

Research

Reaching 1B context length with RAG

We demonstrate a retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time. In this post, we share results demonstrating that our approach (novel retrieval method based on sparse graphs) achieves SoTA performance on the Hash-Hop benchmark, which requires reasoning over elements in an ultra-long context.

Our model excels up to 1 billion context (and beyond), is more compute & memory efficient compared to common RAG systems that use dense embeddings, and is more efficient than long-context transformer-based LLMs.

These preliminary results suggest our algorithm is a promising approach for performing long-context tasks especially in compute constrained scenarios (on device, cost-effective on-prem & cloud etc).

October 21, 2024

Models

Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator

Zyphra is excited to release Zyda2, a 5-trillion token dataset composed of filtered and cross-deduplicated DCLM, FineWeb-Edu, Zyda-1, and Dolma v1.7's Common Crawl portion. Leveraging NVIDIA NeMo Curator, we've dramatically accelerated data processing from 3 weeks to 2 days while reducing costs. Zyda-2 powers our Zamba2 series, pushing the boundaries of small-LLM performance and reinforcing Zyphra's position at the forefront of efficient, high-performance language model development.

October 15, 2024

Models

ZAMBA2-7B

Zyphra is excited to release Zamba2-7B, a state-of-the-art small language model. At the 7B scale, we outperform the leading models of Mistral, Google’s Gemma and Meta’s Llama3 series in both quality and performance.

October 14, 2024

Models

ZAMBA2-MINI (1.2B)

Zyphra is excited to release Zamba2-mini, a state-of-the-art small language model. Zamba2-mini achieves highly competitive evaluation scores and performance numbers and fits in a tiny memory footprint of <700MB at 4bit quantization. 7x drop in params for same performance ; Zamba2- mini (1.2B) ~ Llama2 7B

August 27, 2024

Research

The Zyphra Training Cookbook

Training hybrid models is hard, and papers tend to gloss over the practical engineering work that goes into building good ones. The purpose of this cookbook is to enable other technical groups to hit the ground running when building their own hybrid (SSM, Transformer, MoE) models.

August 26, 2024

Research

Understanding Graph-based RAG and Multi-Hop Question Answering

This blog post discusses the relation between multi-hop question-answering and retrieval from graph-based databases. In particular, we develop a mathematical explanation for why graph databases are useful for answering multi-hop questions. We then implement a simple graph database to augment GPT-4o. We test our RAG system on a new needle-in-the-haystack dataset, called Babilong, and find our system is the best performing model thus far among models not fine-tuned on the dataset.

August 22, 2024

Research

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Zyphra is excited to announce Tree Attention, a novel method for efficiently parallelizing multi-GPU transformer decoding with significant advantages in speed and memory. For instance, we estimate that Tree Attention can decode at the 1M sequence length over 8x faster than existing Ring Attention while requiring 2x less communication volume or more. Moreover, Tree Attention achieves an asymptotic advantage over Ring Attention in the number of devices so the benefit increases dramatically for larger clusters.

August 7, 2024

Models

Zamba2-Small (2.7)

Zyphra is excited to release Zamba2-small, a 2.7B state-of-the-art (SOTA) small language model for on-device applications.

July 28, 2024

Models

Zyda

Zyphra is pleased to announce Zyda, a 1.3T trillion-token open dataset for language modeling. Zyda combines the existing suite of high-quality open datasets together and merges them through a uniform and thorough filtering and deduplication process. The goal of Zyda is to provide a simple, accessible, and highly performant dataset for language modeling experiments and training up to the 1 trillion scale. In our ablation studies, Zyda outperforms all existing open datasets including the Dolma, Fineweb, Pile, RefinedWeb, and SlimPajama.

June 7, 2024

Research

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

There has recently been growing interest in conversational agents with long-term memory which has led to the rapid development of language models that use retrieval-augmented generation (RAG). Until recently, most work on RAG has focused on information retrieval from large databases of texts, like Wikipedia, rather than information from long-form conversations. In this paper, we argue that effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval: 1) time/event-based queries, which requires the model to retrieve information about previous conversations based on time or the order of a conversational event (e.g., the third conversation on Tuesday), and 2) ambiguous queries that require surrounding conversational context to understand. To better develop RAG-based agents that can deal with these challenges, we generate a new dataset of ambiguous and time-based questions that build upon a recent dataset of long-form, simulated conversations, and demonstrate that standard RAG based approaches handle such questions poorly. We then develop a novel retrieval model which combines chained-of-table search methods, standard vector-database retrieval, and a prompting method to disambiguate queries, and demonstrate that this approach substantially improves over current methods at solving these tasks. We believe that this new dataset and more advanced RAG agent can act as a key benchmark and stepping stone towards effective memory augmented conversational agents that can be used in a wide variety of AI applications.¹

June 4, 2024

Models

ZAMBA

Zyphra is proud to release Zamba, a novel 7B parameter foundation model.

April 16, 2024

Research

The Unreasonable Ineffectiveness of the Deeper Layers

We empirically study a simple layer-pruning strategy for popular families of openweight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to “heal” the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.

March 28, 2024

Research

NeuraNoC - A neuroscience-inspired packet switch network-on-chip (NoC)

Zyphra’s NeuraNoC is a pioneering packet-switched network-on-chip (NoC), named for its routing mechanism that resembles the spiking behavior of neurons in the brain by encoding processor connections as Bernoulli processes. It is the first NoC to be trained at compile time to precisely match the bandwidth requirements between connected processors, making it ideal for ML workloads with predictable and sustained bandwidth profiles. Although the packet routing in a hardware network may seem stochastic for a given connection, it is actually deterministic and predefined. This is achieved by treating the packets as carriers that may or may not contain a payload. This approach is similar to time domain multiplexing in arbitrary connectivity graphs. This NoC eliminates all memory blocks typically required in network routers and processing units for packet exchange. As a result, NeuraNoC drastically reduces the silicon footprint, allowing more space for additional compute resources or local memory.

November 10, 2023