At Zyphra, we are deeply invested in optimizing the user experience for AI. We believe the future of AI will involve a combination of cloud and edge deployment strategies with an increasing shift towards on-device inference for various use cases. In particular, we have been looking closely at how to improve the experience on edge devices by carefully designing and crafting hardware-aware models, and by applying personalization techniques. Our Zamba series of models, exemplifies our commitment to innovative foundation model R&D with useful applications on the edge.
This blog post discusses the key factors to consider when deploying models on edge devices. We emphasize the significant hardware constraints of these devices, and identify techniques to efficiently utilize local hardware resources - quantization, low-rank adapters, and real-time parameter offloading from storage.
We explore two case studies on Memory Bandwidth and Memory Capacity for iPhone 15 Pro (Apple) and Jetson Orin (Nvidia) edge platforms.
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, “QLORA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314v1 [cs.LG], May 2023 (https://arxiv.org/pdf/2305.14314)
[2] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge, “Zamba: A Compact 7B SSM Hybrid Model,” arXiv preprint arXiv:2405.16712v1 [cs.LG], May 2024 (https://arxiv.org/pdf/2405.16712)
[3] Opher Lieber, Barak Lenz, Hofit Bata, et al, “Jamba: A Hybrid Transformer-Mamba Language Model,” arXiv preprint arXiv:2403.19887v2 [cs.CL], Jul 2024 (https://arxiv.org/pdf/2403.19887)
[4] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” arXiv:2205.14135v2 [cs.LG], Jun 2022 (https://arxiv.org/pdf/2205.14135).
[5] Tri Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv:2307.08691v1 [cs.LG], Jul 2023 (https://arxiv.org/pdf/2307.08691)
[6] Jay Shah, Ganesh Bikshandi , Ying Zhang , Vijay Thakkar , Pradeep Ramani, and Tri Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” arXiv:2407.08608v2 [cs.LG], Jul 2024 (https://arxiv.org/pdf/2407.08608)
[7] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, “A Survey on Mixture of Experts,” arXiv preprint arXiv:2407.06204v2 [cs.LG], Aug 2024 (https://arxiv.org/pdf/2407.06204)
[8] Artom Eliseev and Denis Mazur, “Fast Inference of Mixture-of-Experts Language Models with Offloading,” arXiv preprint arXiv:2312.17238v1 [cs.LG], Dec 2023 (https://arxiv.org/pdf/2312.17238)
[9] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina, “MoE-Infinity: Offloading-Efficient MoE Model Serving”, arXiv preprint arXiv:2401.14361, Jan 2024
[10] Tim Dettmers, Luke Zettlemoyer, “The case for 4-bit precision: k-bit Inference Scaling Laws”, arXiv preprint arXiv:2212.09720 [cs.LG], Dec 2022
[11] https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
[13] https://developer.arm.com/Processors/Cortex-A78AE
[14] https://nanoreview.net/en/soc-compare/apple-a17-pro-vs-apple-a15-bionic
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, “QLORA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314v1 [cs.LG], May 2023 (https://arxiv.org/pdf/2305.14314)
[2] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge, “Zamba: A Compact 7B SSM Hybrid Model,” arXiv preprint arXiv:2405.16712v1 [cs.LG], May 2024 (https://arxiv.org/pdf/2405.16712)
[3] Opher Lieber, Barak Lenz, Hofit Bata, et al, “Jamba: A Hybrid Transformer-Mamba Language Model,” arXiv preprint arXiv:2403.19887v2 [cs.CL], Jul 2024 (https://arxiv.org/pdf/2403.19887)
[4] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” arXiv:2205.14135v2 [cs.LG], Jun 2022 (https://arxiv.org/pdf/2205.14135).
[5] Tri Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv:2307.08691v1 [cs.LG], Jul 2023 (https://arxiv.org/pdf/2307.08691)
[6] Jay Shah, Ganesh Bikshandi , Ying Zhang , Vijay Thakkar , Pradeep Ramani, and Tri Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” arXiv:2407.08608v2 [cs.LG], Jul 2024 (https://arxiv.org/pdf/2407.08608)
[7] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, “A Survey on Mixture of Experts,” arXiv preprint arXiv:2407.06204v2 [cs.LG], Aug 2024 (https://arxiv.org/pdf/2407.06204)
[8] Artom Eliseev and Denis Mazur, “Fast Inference of Mixture-of-Experts Language Models with Offloading,” arXiv preprint arXiv:2312.17238v1 [cs.LG], Dec 2023 (https://arxiv.org/pdf/2312.17238)
[9] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina, “MoE-Infinity: Offloading-Efficient MoE Model Serving”, arXiv preprint arXiv:2401.14361, Jan 2024
[10] Tim Dettmers, Luke Zettlemoyer, “The case for 4-bit precision: k-bit Inference Scaling Laws”, arXiv preprint arXiv:2212.09720 [cs.LG], Dec 2022
[11] https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
[13] https://developer.arm.com/Processors/Cortex-A78AE
[14] https://nanoreview.net/en/soc-compare/apple-a17-pro-vs-apple-a15-bionic
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, “QLORA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314v1 [cs.LG], May 2023 (https://arxiv.org/pdf/2305.14314)
[2] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge, “Zamba: A Compact 7B SSM Hybrid Model,” arXiv preprint arXiv:2405.16712v1 [cs.LG], May 2024 (https://arxiv.org/pdf/2405.16712)
[3] Opher Lieber, Barak Lenz, Hofit Bata, et al, “Jamba: A Hybrid Transformer-Mamba Language Model,” arXiv preprint arXiv:2403.19887v2 [cs.CL], Jul 2024 (https://arxiv.org/pdf/2403.19887)
[4] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” arXiv:2205.14135v2 [cs.LG], Jun 2022 (https://arxiv.org/pdf/2205.14135).
[5] Tri Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv:2307.08691v1 [cs.LG], Jul 2023 (https://arxiv.org/pdf/2307.08691)
[6] Jay Shah, Ganesh Bikshandi , Ying Zhang , Vijay Thakkar , Pradeep Ramani, and Tri Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” arXiv:2407.08608v2 [cs.LG], Jul 2024 (https://arxiv.org/pdf/2407.08608)
[7] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, “A Survey on Mixture of Experts,” arXiv preprint arXiv:2407.06204v2 [cs.LG], Aug 2024 (https://arxiv.org/pdf/2407.06204)
[8] Artom Eliseev and Denis Mazur, “Fast Inference of Mixture-of-Experts Language Models with Offloading,” arXiv preprint arXiv:2312.17238v1 [cs.LG], Dec 2023 (https://arxiv.org/pdf/2312.17238)
[9] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina, “MoE-Infinity: Offloading-Efficient MoE Model Serving”, arXiv preprint arXiv:2401.14361, Jan 2024
[10] Tim Dettmers, Luke Zettlemoyer, “The case for 4-bit precision: k-bit Inference Scaling Laws”, arXiv preprint arXiv:2212.09720 [cs.LG], Dec 2022
[11] https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
[13] https://developer.arm.com/Processors/Cortex-A78AE
[14] https://nanoreview.net/en/soc-compare/apple-a17-pro-vs-apple-a15-bionic
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, “QLORA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314v1 [cs.LG], May 2023 (https://arxiv.org/pdf/2305.14314)
[2] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge, “Zamba: A Compact 7B SSM Hybrid Model,” arXiv preprint arXiv:2405.16712v1 [cs.LG], May 2024 (https://arxiv.org/pdf/2405.16712)
[3] Opher Lieber, Barak Lenz, Hofit Bata, et al, “Jamba: A Hybrid Transformer-Mamba Language Model,” arXiv preprint arXiv:2403.19887v2 [cs.CL], Jul 2024 (https://arxiv.org/pdf/2403.19887)
[4] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” arXiv:2205.14135v2 [cs.LG], Jun 2022 (https://arxiv.org/pdf/2205.14135).
[5] Tri Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv:2307.08691v1 [cs.LG], Jul 2023 (https://arxiv.org/pdf/2307.08691)
[6] Jay Shah, Ganesh Bikshandi , Ying Zhang , Vijay Thakkar , Pradeep Ramani, and Tri Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” arXiv:2407.08608v2 [cs.LG], Jul 2024 (https://arxiv.org/pdf/2407.08608)
[7] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, “A Survey on Mixture of Experts,” arXiv preprint arXiv:2407.06204v2 [cs.LG], Aug 2024 (https://arxiv.org/pdf/2407.06204)
[8] Artom Eliseev and Denis Mazur, “Fast Inference of Mixture-of-Experts Language Models with Offloading,” arXiv preprint arXiv:2312.17238v1 [cs.LG], Dec 2023 (https://arxiv.org/pdf/2312.17238)
[9] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina, “MoE-Infinity: Offloading-Efficient MoE Model Serving”, arXiv preprint arXiv:2401.14361, Jan 2024
[10] Tim Dettmers, Luke Zettlemoyer, “The case for 4-bit precision: k-bit Inference Scaling Laws”, arXiv preprint arXiv:2212.09720 [cs.LG], Dec 2022
[11] https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
[13] https://developer.arm.com/Processors/Cortex-A78AE
[14] https://nanoreview.net/en/soc-compare/apple-a17-pro-vs-apple-a15-bionic
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, “QLORA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314v1 [cs.LG], May 2023 (https://arxiv.org/pdf/2305.14314)
[2] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge, “Zamba: A Compact 7B SSM Hybrid Model,” arXiv preprint arXiv:2405.16712v1 [cs.LG], May 2024 (https://arxiv.org/pdf/2405.16712)
[3] Opher Lieber, Barak Lenz, Hofit Bata, et al, “Jamba: A Hybrid Transformer-Mamba Language Model,” arXiv preprint arXiv:2403.19887v2 [cs.CL], Jul 2024 (https://arxiv.org/pdf/2403.19887)
[4] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” arXiv:2205.14135v2 [cs.LG], Jun 2022 (https://arxiv.org/pdf/2205.14135).
[5] Tri Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv:2307.08691v1 [cs.LG], Jul 2023 (https://arxiv.org/pdf/2307.08691)
[6] Jay Shah, Ganesh Bikshandi , Ying Zhang , Vijay Thakkar , Pradeep Ramani, and Tri Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” arXiv:2407.08608v2 [cs.LG], Jul 2024 (https://arxiv.org/pdf/2407.08608)
[7] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, “A Survey on Mixture of Experts,” arXiv preprint arXiv:2407.06204v2 [cs.LG], Aug 2024 (https://arxiv.org/pdf/2407.06204)
[8] Artom Eliseev and Denis Mazur, “Fast Inference of Mixture-of-Experts Language Models with Offloading,” arXiv preprint arXiv:2312.17238v1 [cs.LG], Dec 2023 (https://arxiv.org/pdf/2312.17238)
[9] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina, “MoE-Infinity: Offloading-Efficient MoE Model Serving”, arXiv preprint arXiv:2401.14361, Jan 2024
[10] Tim Dettmers, Luke Zettlemoyer, “The case for 4-bit precision: k-bit Inference Scaling Laws”, arXiv preprint arXiv:2212.09720 [cs.LG], Dec 2022
[11] https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
[13] https://developer.arm.com/Processors/Cortex-A78AE
[14] https://nanoreview.net/en/soc-compare/apple-a17-pro-vs-apple-a15-bionic
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, “QLORA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314v1 [cs.LG], May 2023 (https://arxiv.org/pdf/2305.14314)
[2] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge, “Zamba: A Compact 7B SSM Hybrid Model,” arXiv preprint arXiv:2405.16712v1 [cs.LG], May 2024 (https://arxiv.org/pdf/2405.16712)
[3] Opher Lieber, Barak Lenz, Hofit Bata, et al, “Jamba: A Hybrid Transformer-Mamba Language Model,” arXiv preprint arXiv:2403.19887v2 [cs.CL], Jul 2024 (https://arxiv.org/pdf/2403.19887)
[4] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” arXiv:2205.14135v2 [cs.LG], Jun 2022 (https://arxiv.org/pdf/2205.14135).
[5] Tri Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv:2307.08691v1 [cs.LG], Jul 2023 (https://arxiv.org/pdf/2307.08691)
[6] Jay Shah, Ganesh Bikshandi , Ying Zhang , Vijay Thakkar , Pradeep Ramani, and Tri Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” arXiv:2407.08608v2 [cs.LG], Jul 2024 (https://arxiv.org/pdf/2407.08608)
[7] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, “A Survey on Mixture of Experts,” arXiv preprint arXiv:2407.06204v2 [cs.LG], Aug 2024 (https://arxiv.org/pdf/2407.06204)
[8] Artom Eliseev and Denis Mazur, “Fast Inference of Mixture-of-Experts Language Models with Offloading,” arXiv preprint arXiv:2312.17238v1 [cs.LG], Dec 2023 (https://arxiv.org/pdf/2312.17238)
[9] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina, “MoE-Infinity: Offloading-Efficient MoE Model Serving”, arXiv preprint arXiv:2401.14361, Jan 2024
[10] Tim Dettmers, Luke Zettlemoyer, “The case for 4-bit precision: k-bit Inference Scaling Laws”, arXiv preprint arXiv:2212.09720 [cs.LG], Dec 2022
[11] https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
[13] https://developer.arm.com/Processors/Cortex-A78AE
[14] https://nanoreview.net/en/soc-compare/apple-a17-pro-vs-apple-a15-bionic
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.
Deploying models locally has many benefits. Edge deployments can utilize existing compute resources which are often idle, meaning that inference does not require expensive and in-demand data-center Graphical Processing Units (GPUs). With local deployments, sensitive and personal data is never present on remote servers, enhancing user privacy and enabling regulatory compliance. Furthermore, processing data locally can significantly enhance the user experience by reducing model latency. Finally, model personalization can be made more dynamic by storing different weights on different devices – this enables models to be tailored directly on the device to suit individual user preferences and needs. In contrast, this can be challenging to achieve on cloud servers which must batch queries to maintain cost-efficiency.
Ultimately, powerful local models, capable of performing meaningful linguistic, multimodal, and even intellectual work, will likely be deployed on a large variety of edge devices which are specialized for relevant tasks or personalized for local users. These edge models can also be integrated with larger cloud models, e.g. by routing challenging queries to cloud models, or by offloading compute to the cloud where required. This synthesis of cloud and local compute will maximize efficiency in terms of cost and power and enable the widespread and ubiquitous deployment of AI systems.
However, deploying language models (LMs) on local devices also presents a unique set of challenges compared to data center deployments. Edge devices have vastly fewer system resources than data centers - increasing the size of models on edge devices is significantly more difficult than in data centers because memory bandwidth/capacity and processing cores are comparatively limited. We aim to address this by optimizing model architecture for parameter and compute efficiency, a process we have begun with the Zamba model series. Zamba models utilize a novel parameter sharing approach to reduce both the resident set size (RSS), and the number of floating-point operations (FLOP) required to generate the next token. The consequence of this design is higher FLOP efficiency, which directly translates to lower latency and higher throughput.
To achieve more general capabilities and utility, models must be embedded in a scaffolded system allowing, for instance, database or internet lookups (where available), or tool calling, to target specialized use-cases. Creating a general scaffolding system for on-device models is complex, given the variety and heterogeneity of environments and requirements.
The third challenge lies in the heterogenous and often complex architectures of edge devices. The figure below shows the typical architecture of an edge SoC (System-on-a-Chip), highlighting specialized accelerators like the Neural Engine found in the iPhone. It also illustrates the available memory access bandwidths.
Such edge SoCs typically comprise multiple hardware resources, some optimized for ML workloads, such as local GPUs or Neural Processing Units (NPUs). Available resources vary depending on the device. Consequently, to maximize hardware utilization, the software platforms for serving edge models need to consider both effective multitasking, and splitting of work between different processors, which adds implementation complexity. Additionally, the runtime software ecosystem is fragmented and immature, presenting a large number of compile targets.
Nevertheless, the core constraint of edge devices is typically the same as data-center GPUs – memory bandwidth. Memory bandwidth is primarily utilized to load model parameters from RAM into the compute units of the device. In data centers, mini-batching is used to circumvent this memory bound. However, edge deployments typically infer one batch at a time (single-batch/batch size = 1 inference). Thus, optimizing model architecture and inference libraries to reduce memory bandwidth requirements for single-batch inference is critical.
To illustrate some constraints on the edge, let’s consider two practical examples. The first concerns Apple’s iPhone 15 Pro, and the second concerns NVIDIA’s Jetson Orin (64GB).
IPhone 15 Pro has the following specifications (estimated from [14] and [15]):
Jetson Orin (estimated from [11][12][13]):
Now, let’s consider deploying the LLaMA3 8B model on this device. The model has the following attributes:
Assume a compute bound regime; let’s calculate the possible tokens per second one could achieve on this device. Just to make things simple we will make the assumption that the number of flops required for LLaMA3 8B is approximately 16 GFLOPS. We assume this because most operations are done in the MLP layers of transformers, which require 1 multiply and 1 add per parameter, and that batch size is equal to 1.
Let's now estimate the number of tokens that can be processed per second under a memory-bound scenario, where memory bandwidth is the limiting factor. Given a memory bandwidth of 51.2GBps for the iPhone and 204.8GBps for Jetson’s Orin, a 4-bit quantization scheme, and the requirement that model weights must be read at least once per token, we can calculate the number of tokens per second as follows:
Beyond loading the model weights, there is additional overhead from reading and writing activations, which increases the total data that must be transferred for each token generated. Assuming the activations per layer are substantial—potentially hundreds of MBs per layer, depending on the architecture—we estimate that the model size expands from 4GB to 6GB due to these activations. Consequently:
It's evident that, regardless of the compute core selected, memory bandwidth remains the most critical limiting factor on the edge. This theoretical value closely matches actual performance measurements, with variations mainly due to the precise size of activations and context length. To address this, Zyphra is prioritizing techniques that reduce bandwidth utilization, such as quantization, the use of LoRA adapters [1], approximate matrix multiplication methods such as randomized linear algebra, and SVD decomposition. Additionally, focusing on SSM models, which have a very compact memory activation footprint, further minimizes bandwidth used for activations [2, 3].
This indicates that to avoid bandwidth limitations, the selected alpha must satisfy this specific condition. Referring back to the iPhone example, where the DDR to SSD bandwidth ratio is about 50, alpha would need to be 0.0196. This means that only 2% of the model can be loaded from storage while still maximizing DRAM bandwidth, which is not very practical. Considering the previous equation, but this time treating the DDR bandwidth as the unknown variable. This would yield the following result:
With an equal distribution between SSD and DDR (i.e. half the parameters offloaded; alpha = 0.5), the DDR bandwidth utilized would need to match that of the SSD. This would result in a significant reduction in the effective DDR bandwidth used. This necessarily results in significant slowdowns: as per the previous example, DDR memory bandwidth (not FLOPs) is the primary bottleneck on performance. Specifically, in the case of the iPhone, where the SSD bandwidth is 1GBps and the maximum DDR bandwidth is 50GBps, this configuration would reduce DDR bandwidth utilization by a factor of 50 (50/1). This reduction would lead to a 50x slowdown in tokens per second, which is entirely unacceptable.
It’s evident that maximizing SSD bandwidth is challenging, yet the vast storage capacity it offers is incredibly valuable. This is why we’re focused on advancing research in this area. At Zyphra, our team brings together expertise in chip design and model development, actively pursuing applied research to tackle this intricate and compelling issue at the edge.
Optimizing models for the edge requires technical advances in many areas, which we are pursuing at Zyphra. To improve the performance per memory bandwidth of a model, we are exploring a number of techniques: advanced quantization, matrix compression through low-rank approximations, parameter-sharing techniques, and designing and exploiting unstructured sparsity in model parameters and activations. Another approach is to exploit structured sparsity in the form of sparsely activated models such as Mixture of Experts [7] which perform extremely strongly per bit of memory bandwidth required in inference. To address the high memory footprint of such models, they must be carefully designed such that parameters can be offloaded to cheaper, more abundant storage, such as SSD or disk.
The software ecosystem for deployment of models on edge devices is also fragmented. There are tools that compile models for runtime execution, such as IREE and ExecuTorch, which can optimize model execution for different environments, leveraging hardware-specific optimizations to improve performance. There are also different tensor backends, whose operations are optimized for different runtimes. Hardware-specificity typically increases training and inference throughput [4, 5, 6]. We aim to collaborate with and build upon these, and other projects to create general libraries for optimizing model deployment and serving across a wide range of hardware platforms.
To make local models ubiquitous, one of our goals at Zyphra is to build a general serving framework and architecture to run local models and their associated scaffolding efficiently across a wide range of devices. This architecture will require both a highly optimized and efficient inference runtime, as well as being capable of inferencing existing model architectures and exploiting sparsity efficiently. Such a system must also support task scheduling to integrate the hardware state with potential user requests to optimize the response given the constraints. For instance, this system could assess availability of access to internet APIs or cloud models for task offloading, detect when there is space available for on-device finetuning and personalization, and decide when to run background batch tasks vs handling user requests. Additionally, such a system must efficiently dispatch and route work to a variety of hardware cores on an SoC with differing capabilities and strengths. This system will have several components:
Local devices running inference at low batch sizes to handle specific use-cases or individual users enables a degree of model personalization, such as full finetuning, that would be highly inefficient to replicate in the cloud. This provides one of the key advantages of local model deployments – the ability to tailor your model to your own needs. To this end, we are exploring a number of approaches that allow the on-device customization of model behavior and parameters.
Overall, we believe that local, private, and personalized models offer compelling advantages in many domains and are required to achieve ubiquitous, personal AI. Zyphra aims to achieve highly performant models for local devices through careful co-design of model architecture with hardware constraints to maximize hardware utilization, memory-bandwidth and flop efficiency. Moreover, Zyphra aims to enable the online and on-device personalization of AI models for specific use-cases and specific users which we believe will unlock significant utility.
[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, “QLORA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314v1 [cs.LG], May 2023 (https://arxiv.org/pdf/2305.14314)
[2] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge, “Zamba: A Compact 7B SSM Hybrid Model,” arXiv preprint arXiv:2405.16712v1 [cs.LG], May 2024 (https://arxiv.org/pdf/2405.16712)
[3] Opher Lieber, Barak Lenz, Hofit Bata, et al, “Jamba: A Hybrid Transformer-Mamba Language Model,” arXiv preprint arXiv:2403.19887v2 [cs.CL], Jul 2024 (https://arxiv.org/pdf/2403.19887)
[4] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” arXiv:2205.14135v2 [cs.LG], Jun 2022 (https://arxiv.org/pdf/2205.14135).
[5] Tri Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv:2307.08691v1 [cs.LG], Jul 2023 (https://arxiv.org/pdf/2307.08691)
[6] Jay Shah, Ganesh Bikshandi , Ying Zhang , Vijay Thakkar , Pradeep Ramani, and Tri Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” arXiv:2407.08608v2 [cs.LG], Jul 2024 (https://arxiv.org/pdf/2407.08608)
[7] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, “A Survey on Mixture of Experts,” arXiv preprint arXiv:2407.06204v2 [cs.LG], Aug 2024 (https://arxiv.org/pdf/2407.06204)
[8] Artom Eliseev and Denis Mazur, “Fast Inference of Mixture-of-Experts Language Models with Offloading,” arXiv preprint arXiv:2312.17238v1 [cs.LG], Dec 2023 (https://arxiv.org/pdf/2312.17238)
[9] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina, “MoE-Infinity: Offloading-Efficient MoE Model Serving”, arXiv preprint arXiv:2401.14361, Jan 2024
[10] Tim Dettmers, Luke Zettlemoyer, “The case for 4-bit precision: k-bit Inference Scaling Laws”, arXiv preprint arXiv:2212.09720 [cs.LG], Dec 2022
[11] https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
[13] https://developer.arm.com/Processors/Cortex-A78AE
[14] https://nanoreview.net/en/soc-compare/apple-a17-pro-vs-apple-a15-bionic