Zyphra

Back

Model

Oct 14, 2024

Palo Alto, California

ZAMBA2-7B

Zyphra is excited to release Zamba2-7B, a state-of-the-art small language model. At the 7B scale, we outperform the leading models of Mistral, Google’s Gemma and Meta’s Llama3 series in both quality and performance.

Zyphra Team

Read technical report

Hugging Face

GitHub

No headings found on page

Zamba2-7B Highlights

Zamba2-7B achieves SOTA evaluation benchmark performance and superior inference efficiency compared to currently leading 7B models such as Mistral-7B, Gemma-7B, and Llama3-8B
Zamba2-7B is extremely inference-efficient, achieving 25% faster time to first token, a 20% improvement in tokens per second, and a significant reduction in memory usage compared to models such as Llama3-8B.
Architectural improvements over Zamba1-7B:
- Mamba1 blocks have been replaced with Mamba2 blocks
- Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network.
- We apply a LoRA projector to each shared MLP block, which allows the network to specialize the MLPs at each invocation of the shared layer across depth.
We release the model weights open-source (Apache 2.0)

Model Quality

Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.

Our model outperforms existing state-of-the-art models for the following reasons:

Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and openly-available datasets that are aggressively filtered and deduplicated and achieves a state-of-the-art quality in ablations vs the existing top open-source pretraining datasets.
We have a separate "annealing" pre-training phase, which rapidly decays the learning rate over 100B high-quality tokens. Our annealing set is carefully curated for quality and collated from varied high-quality sources.

Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.

Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.

Zamba2-7B Inference Performance

We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.

Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.

Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct

‍Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B

‍Pure PyTorch: https://github.com/Zyphra/Zamba2

Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.