Zyphra is excited to release Zamba2-mini, a state-of-the-art small language model for on-device applications.
Zamba2-mini achieves highly competitive evaluation scores and performance numbers and fits in a tiny memory footprint of <700MB at 4bit quantization.
7x drop in params for same performance ; Zamba2- mini (1.2B) ~ Llama2 7B
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:
Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models and is especially suited to on-device environments where memory capacity is constrained and inference speed is paramount.
Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-mini makes some architectural improvements over Zamba1-7B:
Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.
Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.
Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.