In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300x accelerators.
Zyphra is excited to release Zyda2, a 5-trillion token dataset composed of filtered and cross-deduplicated DCLM, FineWeb-Edu, Zyda-1, and Dolma v1.7's Common Crawl portion. Leveraging NVIDIA NeMo Curator, we've dramatically accelerated data processing from 3 weeks to 2 days while reducing costs.
Zyphra is excited to release Zamba2-7B, a state-of-the-art small language model. At the 7B scale, we outperform the leading models of Mistral, Google’s Gemma and Meta’s Llama3 series in both quality and performance. We believe Zamba2-7B is the leading model for running on-device and on consumer GPUs as well as for many enterprise applications which require a powerful but compact and efficient model for natural-language tasks.
We demonstrate a retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time. These preliminary results suggest our algorithm is a promising approach for performing long-context tasks especially in compute constrained scenarios (on device, cost-effective on-prem & cloud etc).
Zyphra’s NeuraNoC is a pioneering packet-switched network-on-chip (NoC), named for its routing mechanism that resembles the spiking behavior of neurons in the brain by encoding processor connections as Bernoulli processes.
Zyphra is excited to release Zamba2-mini, a state-of-the-art SLM for on-device applications. Zamba2-mini achieves highly competitive evaluation scores and performance numbers and fits in a tiny memory footprint of <700MB at 4bit quantization. Zamba2- mini (1.2B) ~ Llama2 7B
Training hybrid models is hard, and papers tend to gloss over the practical engineering work that goes into building good ones. The purpose of this cookbook is to enable other technical groups to hit the ground running when building their own hybrid (SSM, Transformer, MoE) models.
In this post, we discuss and illustrate the usefulness of graph-based RAG systems for multi-hop Question-Answering (QA) tasks. Multi-hop questions are those that require a chain of multiple retrieval steps to answer.
This blog post discusses the key factors to consider when deploying models on edge devices. We emphasize the significant hardware constraints of these devices, and identify techniques to efficiently utilize local hardware resources - quantization, low-rank adapters, and real-time parameter offloading from storage.
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.
Investors need to think outside the box when it comes to addressing artificial intelligence’s energy problem.
Zyphra is excited to announce Tree Attention, a novel method for efficiently parallelizing multi-GPU transformer decoding with significant advantages in speed and memory.
Effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval.
An Efficient And Faster Small Language Model
A 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv
An LLM training dataset with 1.3T tokens
An SSM-hybrid foundation model to bring AI to more devices
The Startup Tackling Karpathy’s Vision
A Novel Architecture that Combines the Mamba SSM with MoE to Obtain the Benefits of Both
Zyphra is excited to release Zamba2-small, a 2.7B state-of-the-art (SOTA) small language model for on-device applications.