Introduction

Computational ML models can be represented as a graph, where the nodes denote operations and the edges denote the amount of information exchanged between connected nodes. These nodes (or groups of nodes, as illustrated in Figure 1) can then be mapped to processors in an array – for instance individual tensor cores in a GPU – based on the operations they require, the memory they need, and their connections to other nodes, with careful consideration of their proximity to connected neighbors in the graph.

The process of assigning a specific processor in an array to a node in the graph is known as "mapping". Information exchange between processors within the die is managed by an on-chip interconnect fabric. The process of programming this interconnect fabric is referred to as “routing”. Developing mappers and router compilers for these interconnect fabrics is usually a complex and challenging task since they must balance the spatial topology of processors on the chip and the bandwidth requirements of particular programs to ensure the processors can be fully utilized.

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

As fabrication processes advance, enabling more transistors to be packed into a given silicon area, it becomes possible to integrate more processing units on these dies. This increase in processing units leads to significantly higher on-chip memory bandwidth, as demonstrated by chips like Tesla’s Dojo, Cerebras WSE-3, and Groq's LPU. However, as the number of processors increases, the need for an efficient interconnect strategy becomes crucial—one that can sustain bandwidth levels comparable to those between a processor and its local memory.

Commonly the interconnect fabric of choice are networks on chip (NoCs), which traditionally introduce significant overheads in silicon area, primarily due to the memory required to temporarily store data in NoC switches or to hold packets until the corresponding processor is ready to process them. In networks that enable point-to-point connections between any two processors, these memories are implemented on-chip as SRAM blocks. This results in significant silicon overhead when multiple blocks are needed for each NoC switch, often leading to a large portion of the silicon area being dedicated to the on-chip NoC.

Our proposed NeuraNoC overcomes the challenges inherent in traditional NoC designs. It is the first NoC system to undergo a "training" process at compile-time tailored to meet the bandwidth demands of specific workloads, simplifying the place-and-route process. We regard the packets routed in the network as 'carriers', whose behavior is entirely independent of the specific state of any processors exchanging information with the NoC. These carriers define the bandwidth between connected nodes and operate in a completely deterministic manner, similar to time slots in time domain multiplexing within a token ring network. This determinism enables processors to precisely know when they can send and receive data, down to the exact clock cycle. Processors knowing specifically the clock cycle where they will either receive or can send data, allows to further remove the inbound and outbound buffers usually present in networks on chip. Moreover, NeuraNoC eliminates all buffers typically found in switches, allowing for the complete removal of the memory usually required in conventional NoC approaches, allowing to achieve a very low silicon area footprint, while still maintaining homogeneous latency and bandwidth over time.

Thanks to ML workloads featuring predictable and generally static bandwidth profiles for long periods of time, they are particularly suitable for the NeuraNoC, as those bandwidth profiles can be provided at compile time. We have developed an algorithm that takes as an input these bandwidth profiles and produces a highly efficient configuration for the NoC to satisfy those needed internode bandwidths.

How can we report NoC performance results?

Zyphra's NeuraNoC, is a highly adaptable multi-processor array fabric that operates efficiently with minimal prerequisites and no dependence on a specific network topology. Designed to avoid deadlocks and livelocks, the NeuraNoC ensures packet delivery within a finite number of cycles without packet losses. Differing from other NoCs that utilize local buffers to manage traffic, the NeuraNoC operates without buffers or FIFO memories, thereby minimizing the silicon footprint in multi-processor arrays. This predictable and deterministic behavior enables processing units (PUs) to precisely time the injection and reception of packets without depending on queueing buffers.

In conventional networks on-chip, a processor connected to a network switch consumes data at a specific input rate R_in, while the data may arrive from the NoC at a different rate R_NoC. Typically, each network switch is built with a small amount of local SRAM memory to which packets destined for a node must be written. Within a network node, PUs may use the local memory for various purposes, causing incoming packets from the NoC to wait for available bandwidth to be written locally, resulting in fluctuations in input rate R_in and leading to large buffers being required to ensure packets are not dropped.

Both R_in and R_NoC can be viewed as random variables, with instantaneous rates that can vary over time but with average rates that must be equal E(R_in(t) = E(R_NoC(t))). Similarly, the rate at which a PU wants to inject packets into the NoC (rate R_out) can differ from the rate at which the NoC can send packets to other PUs (rate R_NoC). As R_out, R_NoC, and R_in can fluctuate over time, intermediate buffers are necessary to alleviate any instantaneous rate changes. Generally, these buffers are implemented as the inbound and outbound network buffers found in traditional network nodes as seen in figure 2.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

The predictability of traffic found in the NeuraNoC eliminates the need for the intermediate buffers used in equalizing data rates. Furthermore, no other buffers – such as the buffers required to further equalize traffic from different interfaces in a node switch used in traditional NoC designs – will be needed in our NeuraNoC design. With a hardened NoC configuration driven by workloads of interest, our NoC compiler allows us to specify the needed bandwidth between any logically connected nodes at compile time.

The NeuraNoC only features registers in the buses connecting neighboring nodes, as seen in Figure 3. Even though this example shows the case of a 4x4 mesh, the proposed network works for any network topology chosen, as long as a series of simple requirements are satisfied. Generally, when considering the “place and route” (P&R) backend process of a NoC, the number of pipelining stages between nodes is chosen based on the target clock frequency, and any addition of pipelining stages will only serve this purpose. On the other hand, our proposed design cleverly utilizes the registers between nodes as the traffic equalizing buffers. Therefore, the buffers that traditional NoC designs place inside every network node, as shown in figure 2, can now be embedded into the edges of the network itself. This design allows all the nodes to share these memory elements more efficiently.

With this in mind, the proposed NeuraNoC network can dramatically reduce silicon resources compared to other NoCs since it does not require any memory blocks of any kind. The addition of pipelining stages will now not only help in achieving the desired clock frequency, but will also increase the overall embedded NoC buffer size, thereby both increasing the network’s degrees of freedom and allowing the system to more accurately achieve the desired bandwidths between logically-connected nodes.

Introduction

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Overview

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

A connection from node B to node A both reads data from A and writes into B, and reads from node B and writes into A. NDGNoC is useful for simulating non-directed computational graphs such as dynamical systems, non-directed ML graphs, some energy-based training algorithms, ML training, etc.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

NoC performance results need to be presented based on particular workloads of interest. Precise and specific workloads are difficult to come up with and to test rigorously, because they require graph pre-compilers, and proper hardware graph mappers. This is why it is very difficult to find actual performance results for NoC designs, and many of those found are very questionable. Cerebras claims numbers like “total fabric bandwidth to 220 petabits per second”, or Graphcore claims “47.5TB/s memory bandwidth per IPU”. Those numbers do not mean much if those are not bandwidths that are actually sustained over time for an actual meaningful workload. How can we provide generic performance results that are actually meaningful?

To address this, we designed a benchmark where we measure the ability of the network to exploit data locality, and for that we will define a receptive field for every node in the NoC. The receptive fields are defined by a target Hamming Distance. The workloads we will test will consider all nodes in the NoC to be connected to nodes that are within a given Hamming Distance (HD). A HD of 1 will make nodes be connected only to their nearest neighbors, and a very high HD will make the NoC behave closer to a crossbar.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

How can we report NoC performance results?

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Introduction

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

Overview

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Assuming a toroid topology, and having each node connected to its neighbors within a specified Hamming distance, we estimated the energy dissipation, bandwidth, and precise PPA outcomes for TSMC 5nm process. Utilizing our network compiler and a cycle-accurate emulator, we confirmed that PPA remains constant for a given network connectivity density, regardless of the number of nodes present in the NoC. This benchmark we developed enables quick performance evaluation based on the local connection density in a computational graph, eliminating the need to execute the network compiler for each workload.

In this video, we showcase a 10x10 torus NoC setup with 256 randomly established connections featuring varied bandwidths. The arrows linking nodes illustrate the pipelining stages set between the network's processing units. In the upper right, the x-axis displays the various connections defined, with the target bandwidths for these connections highlighted in dark blue. As the network simulation progresses, the empirical bandwidths achieved, shown in light blue, will be observed converging to or exceeding these target values. In the bottom right corner, we detail the latencies experienced for each connection.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Left: 10x10 torus NoC configuration with 256 randomly set connections of different bandwidths. Arrows between nodes depict the pipelining stages among the network's processing units.

Right Top: The x-axis shows various connections with target bandwidths in dark blue. As the simulation progresses, actual bandwidths, displayed in light blue, converge to or exceed these targets.

Right Bottom: Latencies experienced for each connection.

Link to Cookbook (GitHub)

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

What is Annealing?

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Left: 10x10 torus NoC configuration with 256 randomly set connections of different bandwidths. Arrows between nodes depict the pipelining stages among the network's processing units.

Right Top: The x-axis shows various connections with target bandwidths in dark blue. As the simulation progresses, actual bandwidths, displayed in light blue, converge to or exceed these targets.

Right Bottom: Latencies experienced for each connection.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Introduction

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

Overview

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

How can we report NoC performance results?

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Left: 10x10 torus NoC configuration with 256 randomly set connections of different bandwidths. Arrows between nodes depict the pipelining stages among the network's processing units.

Right Top: The x-axis shows various connections with target bandwidths in dark blue. As the simulation progresses, actual bandwidths, displayed in light blue, converge to or exceed these targets.

Right Bottom: Latencies experienced for each connection.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

How can we report NoC performance results?

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Table 1: Evaluation scores for Zyda-2 vs alternative datasets broken down more granularly by specific evaluation metric

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Left: 10x10 torus NoC configuration with 256 randomly set connections of different bandwidths. Arrows between nodes depict the pipelining stages among the network's processing units.

Right Top: The x-axis shows various connections with target bandwidths in dark blue. As the simulation progresses, actual bandwidths, displayed in light blue, converge to or exceed these targets.

Right Bottom: Latencies experienced for each connection.

Analysis of Global Duplicates

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.

Figure 7: Distribution of cluster sizes of duplicates in global dataset (log-log scale).

Figure 8: Distribution of cluster sizes of duplicates in DCLM (log-log scale).

Figure 9: Distribution of cluster sizes of duplicates in FineWeb-Edu2 (log-log scale).

Figure 10: Distribution of cluster sizes of duplicates in Zyda-1 (log-log scale).

Figure 11: Distribution of cluster sizes of duplicates in Dolma-CC (log-log scale).

Largest cluster in DCLM

Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
‍
‍Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
‍‍
‍The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
‍‍
‍Positive votes:
Negative votes:
Vote Up Vote Down review
‍‍
‍Have you had bad experience with Warn us, please!

Examples of varying quality score in a cluster of duplicates in DCLM

Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.

Document ID: <urn:uuid:941f22c0-760e-4596-84fa-0b21eb92b8c4>

Quality score of: 0.19616

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, Rovi

Document ID: <urn:uuid:0df10da5-58b8-44d8-afcb-66aa73d1518b>

Quality score of: 0.091928

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
Sean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:4986ef09-3ee3-4e13-9084-7898aaf72aaf>

Quality score of: 0.072259

recent on-air advertisers

Now Playing

You Control the ...

Artist Snapshot:

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, RoviSean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:1e0496a9-0116-418a-9aec-e65b1d20e709>

Quality score of: 0.0424

18 June 2015

ROME self titled 1996

by request

Artist Biography by

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
‍
1 Leaving Perdition 8:10
2 Intermodal 3:39
3 Lunar White 3:25
4 She's A Black Belt 3:14
5 Rohm 1:09
6 Radiolucence (Version) 5:31
7 Deepest Laws 14:14

No comments:

Introduction

Reported scores underlined.

Pass@1 scores with greedy sampling.

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

Pass@1 scores with greedy sampling. Livebench 2024-11-25.
Bold: Best score at 1.5B scale w/ greedy sampling
*reported scores

Evals (reported underlined). All numbers pass@1 estimated using n=16

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Overview

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

How can we report NoC performance results?

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Footnote: Training on the Eurus-2-RL dataset did not match the DeepScaleR math evaluation numbers, possibly due to lower quality synthetic math questions in NuminaMath-CoT providing a mixed training signal, or the solvability filtering process with QwQ-preview reducing the difficulty of the dataset. Additionally, the relatively small percentage of code (5%) likely led to math dominating training at the expense of code performance. Training on domain specific datasets and merging resulting models seems to be a potential way to counteract this problem, as demonstrated with SFT in Light-R1.

Left: 10x10 torus NoC configuration with 256 randomly set connections of different bandwidths. Arrows between nodes depict the pipelining stages among the network's processing units.

Right Top: The x-axis shows various connections with target bandwidths in dark blue. As the simulation progresses, actual bandwidths, displayed in light blue, converge to or exceed these targets.

Right Bottom: Latencies experienced for each connection.

Introduction

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

Overview

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

How can we report NoC performance results?

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Figure 5. Left is showing the case of a directed graph, and the right one undirected. We make every node be connected to nodes within a given hamming distance.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)

Figure 4. Two nodes connected with directed (left) and non-directed (right) NeuraNoC nodes.

Introduction

Overview

Figure 1. Example of a computation graph that can be collapsed into different clusters of nodes that are used to map on multi-core arrays.

How can we report NoC performance results?

Prompt #1

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #2

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #3

You don't even think to call me "Godfather." You come into my house on the day my daughter is to be married and you ask me to do murder - for money.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #4

Brave bakers boldly baked big batches of brownies in beautiful bakeries.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #5

Active artists always appreciate artistic achievements and applaud awesome artworks.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #6

I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #7

F one F two F four F eight H sixteen H thirty two H sixty four

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #8

Its chlorover. Like totally chlorover. Totally. Completely. Chlorover.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #9

Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Figure 2. Example of a more traditional NoC, where buffers are used to smooth the effect of instantaneous data rate changes. Traffic flows from one network node to another one. The rate node 1 outputs data will be distributed as R_out ∼ R_out(μ,σout), and the rate at which node 2 can receive data will be R_in ∼ R_in (μ, σin ). Due to the NoC not having all its time slots necessarily free, the rate at which packets can be taken from PU 1 will be _RNoC ∼ R_NoC(μ,σNoC).

Figure 3. This is an example of a 4 x 4 2D mesh. The proposed network will only feature registers in the buses connecting neighboring nodes (drawn in light blue).

Two different networks have been developed, with or without directional edges, showcased in figure 4.

Directed Graph NoC (DGNoC)

A connection from node B to node A reads data from B and writes into A. DGNoC is useful in simulating directed graphs such as data pipeline systems, inference calculation, etc.

Non-Directed (or bidirectional) Graph NoC (NDGNoC)