Bandwidth: a memory bandwidth benchmark

Revision 38
© 2012-2020 by Zack Smith. All rights reserved.


My program, called simply bandwidth, is an artificial benchmark primarily for measuring memory bandwidth on x86 and x86_64 based computers, useful for identifying weaknesses in a computer's memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.

bandwidth also tests some libc functions and, under GNU/Linux, it attempts to test framebuffer memory access speed if the framebuffer device is available.

This program is open source and covered by the GPL license. Although I wrote it mainly for my own benefit, I am also providing it pro bono, i.e. for the public good.

Change log

Release 1.9

More object-oriented improvements.

Release 1.8

More object-oriented improvements. Windows 64-bit support.

Release 1.7

Isolated Object-Oriented C library.

Release 1.6

Updated to use object-oriented C. Fixed Raspberry pi support.

Release 1.5

Improved 256-bit routines. Added --nice switch.

Release 1.4

I added randomized 256-bit routines for 64-bit Intel CPUs.

Release 1.3

I added CSV output. I updated the ARM code for the Raspberry pi 3 (AArch32).

Release 1.2

I put my old 32-bit ARM code back in for generic ARM systems.

Release 1.1

This release adds a second, larger font.

Release 1.0

This update separates out the graphing functionality. It also adds tests for the LODS[BWDQ] instructions, because while it is common knowledge that these instructions are slow and useless, sometimes widely-held beliefs are wrong, so I added this test which proves just how dramatically slow LODS instructions are.

Release 0.32

A little support for AVX.

Release 0.31

This release adds printing of cache information for Intel processors in 32-bit mode.

Release 0.30

This release adds printing of cache information for Intel processors in 64-bit mode.

Release 0.29

Further improved granularity with addition of 128-byte tests. Removed ARM support.

Release 0.28

Added proper feature checks using the CPUID instruction.

Release 0.27

Added 128-byte chunk size tests to x86 processors to improve granularity, especially around the 512-byte dip seen on Intel CPUs.

Release 0.26

AMD processors don't support SSE4 vector instructions, so I updated bandwidth to not utilize those when running on AMD-based computers.

Release 0.25

This update, released December 17th, 2010, extended the network bandwidth testing.

Release 0.24

Added a network bandwidth test.

Release 0.23

The latest bandwidth adds support for 64-bit Windows, so it now supports:

  • 32- and 64-bit GNU/Linux
  • 32- and 64-bit Windows
  • 64-bit Mac OS/X
  • Raspberry pi ARM running 32-bit Raspbian
  • Generic ARM

And it already supported three processor architectures:

  • x86: performs 128- and 32-bit transfers
  • x86_64: performs 128- and 64-bit transfers
  • ARM 32-bit.

Why write the core routines in assembly language?

For each architecture, I've implemented optimized core routines in assembly language.

Because the exact same core assembly language routines run on all computers of a given architecture, it is similar to the same ruler being used to measure multiple items.

This is the crucial approach. If the core routines had been written in C or C++, the final code that is executed would differ depending on the verson of the compiler and compilation options and measurements could not be used for valid comparisons.

Results from Macbook Pro with 2.4 GHz Core i5 520M and 1066 MHz RAM

How fast is each type of storage on a typical system? This is the kind of thing students of Computer Architecture are asked on final exams.

For my Macbook Pro, the numbers are as follows.

  • Reading from the Crucial m4 SSD: 250 MB/second.
  • Reading from main memory (1066 MHz Crucial): maximum 7 GB/second = 28 times faster.
  • Reading from L3 cache: maximum 21 GB/second = 3 times faster than main memory or 86 times faster than SSD.
  • Reading from L2 cache: maximum 29.5 GB/second = 1.4 times faster than L3; 4.2 times faster than main memory; or 120 times faster than SSD.
  • Reading from L1 cache: maximum 44.5 GB/second = 1.5 times faster than L2; 2.1 times faster than L3; 6.4 times faster than main memory; or 178 times faster than SSD.

And the SSD is up to 4 times faster than the original hard disk drive.

Observations of running one instance of bandwidth

The table below presents program output from recent and former versions of bandwidth juxtaposed. They all use the same core routines. These numbers cover only sequential accesses.

The first interesting thing to notice is the difference in performance between 32, 64, and 128 bit transfers on the same processor. These differences show that if programmers were to go through the trouble to revise software to use 64 or 128 bit transfers, where appropriate and especially making them aligned to appropriate byte boundaries and sequential, great speed-ups could be achieved.

A second observation is the importance of having fast DRAM. The latest DRAM overclocked can give stupendous results.

A third observation is the remarkable difference in speeds between memory types. In some cases the L1 cache is more than twice as fast as L2, and L1 is up to 9 times faster than main memory, whereas L2 is often 3 times faster than DRAM.

OS Transfer size PC Make/model CPU CPU speed Front-side bus speed L1 read MB/sec L1 write MB/sec L2 read MB/sec L2 write MB/sec Main read MB/sec Main write MB/sec Main memory RAM type/speed
Intel GNU Linux 64 128 bits Intel Core i7-930 Overclock
4.27 GHz 2000 MHz 64900 65100 43200 39900 18400 12800 DDR3-2000
Mac OS/X Snow Leopard 128 bits Macbook Pro 15 2010 Intel Core i5-520M 2.4 GHz 1066 MHz 44500 44500 29600 27300 7100 5200 PC3-8500
Intel GNU/Linux 64 128 bits Lenovo Thinkpad T510 Intel Core i5-540M 2.53 GHz 1066 MHz 42000 42000 28500 26500 8000 3500 PC3-8500
Mac OS/X Snow Leopard 128 bits Macbook Pro MC374LL/A Intel Core 2 Duo P8600 2.4 GHz 1066 MHz 36500 34500 17000 14300 5620 5380* PC3-8500
Intel GNU/Linux 64 128 bits Thinkpad Edge 15 Intel Core i3-330M 2.13 GHz 1066 MHz 32110 32070 21380 19730 6390 2790 DDR3-1066
Intel GNU/Linux 64 128 bits Toshiba L505 Intel T4300 2.1 GHz 800 MHz 31930 30190 15000 12500 4828 4036* DDR2-800
Intel GNU/Linux 64 128 bits Toshiba A135 Intel Core 2 Duo T5200 1.6 GHz 533 MHz 24250 18970 9619 7237 2995. 2299. PC2-4200
Intel GNU/Linux 32 32 bits Lenovo 3000 N200 Celeron 550 2.0 GHz 533 MHz 7489 7125 6533 5007 2088 1290. PC2-5300
Intel GNU/Linux 32 32 bits Toshiba A205 Pentium Dual T2390 1.86 GHz 533 MHz 7098 6734 7095 5675 2146 1255 PC2-5300
Intel GNU/Linux 32 32 bits Acer 5810TZ-4761 Intel SU4100 1.3 GHz 800 MHz 4937 4682 4160. 3013 1803 1682 DDR3-1066
Intel GNU/Linux 32 32 bits Dell XPS T700r Pentium III 700 MHz 100 MHz 2629 2284 2607 1630. 448.5 163.7 PC100
ARM GNU/Linux 32 32 bits Sheevaplug Marvell Kirkwood ARM 1.2 GHz 3418. 529.0 469.6 859.1 396.0 546.1 DDR2
Windows Mobile 32 bits HTC Jade 100 Marvell ARM 624 MHz 2165. 483.7 130.7 434.5
Intel GNU/Linux 32 32 bits IBM Thinkpad 560E Pentium MMX 150 MHz Up to 66 MHz 500.7 75.49 520.6 74.81 86.64 74.32 EDO 60ns; 50 MHz

  • = Rate for writing while bypassing caches.

Note: Since I added graphing to bandwidth, I am no longer updating this table.

Running multiple instances of bandwidth simultaneously

Is bandwidth actually showing the maximum bandwidth to and from main memory? There is an easy way to test this. We can run one instance of bandwidth on each core of a multi-core CPU (in my case, two instances, one for each core) and add up the access bandwidths to/from main memory for all instances to see whether they approach the published limits for our main memory system.

On my Core i5 dual-core system, with DDR3 (PC3-8500) memory, the maximum RAM bandwidth ought to be 8500 MB/second.

Running on just one core:

  • Reading, it maxes out at 7050 MB/second from main memory.
  • Writing through the caches, it maxes out at 5120 MB/second to main memory.
  • Writing and bypassing the caches, it maxes out at 5520 MB/second to main memory.

When I've got two instances of bandwidth running at the same time, one on each core, the picture is a little different but not much.

  • Reading, the total bandwidth from main memory is 8000 MB/second, nearing the memory's maximum, or 14% faster than running just one instance of bandwidth.
  • Writing without bypassing the caches, the total bandwidth to main memory is 5650 MB/second, which is 10% faster than one instance.
  • Writing with the cache bypass, the total bandwidth to main memory is 6050 MB/second, which is 10% faster than one instance.

Thus, to really ascertain the upper performance limit of the main memory, it behooves the earnest benchmarker to run multiple instances of bandwidth and sum the results.

Graphs for memory bandwidth tests

Xeon E5-2630 v4 running at 2.2GHz, no TurboBoost, with 25MB smart cache and 128GB RAM Quad-Channel DDR4-2400, running GNU/Linux.

Intel Xeon E5-2690, rated at 2.9 GHz with Turbo Boost speed 3.8 Ghz, with two sticks of DDR3 1600 MHz RAM (rated 12.8 GB/s/channel but not using dual-channel mode), running 64-bit GNU/Linux.

Intel Core i5-2520M, rated at 2.5 GHz but running at the Turbo Boost speed 3.2 Ghz, with two sticks of DDR3 1333 MHz RAM (rated 10.6 GB/s/channel), running 64-bit Ubuntu 10.

Intel Core i5-520M at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:

Intel Core 2 Duo P8600 at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:

Intel Core i5-540M at 2.53 to 3.07 GHz with 3MB L3 cache, running 64-bit GNU/Linux:

Intel Core i3-330M at 2.16 GHz with 3MB L3 cache, running 64-bit GNU/Linux:

Intel Pentium T4300 at 2.1 GHz with 1MB L2 cache, running 64-bit GNU/Linux:

Intel Core 2 Duo T5600 at 1.83 GHz, with 2MB L2 cache, in a Mac Mini:

Intel Core 2 Duo T5200 at 1.6 GHz, with 2MB L2 cache, running GNU/Linux 64-bit:

Intel Core 2 Duo E8400 at 3 GHz, with 6 MB L2 cache:

OMAP 3530 Cortex A8 in a Beagle Board running 32-bit GNU/Linux 2.6.29 at 720 MHz (turbo mode):



What changed:
Fixed bug affecting Linux.


Support for AVX512 was not completed and if your CPU offers that feature you might observe a crash. I don't have time to fix this, but if anyone would like to provide a patch I can make it available here.

Be nice

On GNU/Linux, I recommend using nice -n -2 when running bandwidth. The kernel may attempt to throttle the process otherwise.


On Mac OS/X, you will certainly need to upgrade to the latest NASM. Then compile with make bandwidth64. Note that the latest MacOS does not allow 32-bit compilation.

On GNU/Linux, you need a copy of NASM and the GCC suite. A decent distro will supply these. Simply type make bandwidth32 or make bandwidth64 to produce the Intel executables.

On Windows:

  • 64-bit: To compile on 64-bit Windows type make bandwidth64.
  • 32-bit: Because Cygwin is no longer available for 32-bit Windows you may have to use MinGW. To compile for 32-bit Windows, type make bandwidth32.


Intel's Max Memory Bandwidth number

When Intel says you can achieve a Max Memory Bandwidth of e.g. 68 GB/sec from your 18-core processor, what they mean is the upper combined limit for all cores. To test this, you can run multiple copies of my bandwidth utility simultaneously, then add up the bandwidth values from each core accessing main memory. Each individual core may achieve quite a bit less bandwidth going to main memory. That's OK.

This larger number may at first seem like a marketing gimmick from Intel but it's a good number to know because when your system is extremely busy, this is the upper limit that will contrain all the cores' combined activity. What Intel should also do is give the per-core maximum alongside the collective maximum.

The impact of an L4 cache

Level 4 caches are ostensibly for improving graphics performance, the idea being that the GPU shares it with the CPU. But does it impact on CPU performance?

A bandwidth user Michael V. provided a graph that shows that it does for the Intel Core i7 4750HQ. The 128MB L4 cache appears to be roughly twice as fast as main memory.

ARM support

I have reinstated ARM support but mainly for the Raspberry pi 3. An earlier release of bandwidth supported 32-bit ARM CPUs found in Windows Mobile phones and iOS devices. There is a lot of variability in ARM CPUs in terms of what instructions are supported, so I don't plan to expand ARM support very much beyond the Rpi series.

Sequential versus random memory access

Modern processor technology is optimized for predictable memory access behaviors, and sequential accesses are of course that. As the graphs above show, out-of-order accesses disrupt the cache contents, resulting in lower bandwidth. Such a result is more like real-world performance, albeit only for memory-intensive programs.

Generalizations about memory and register performance

One has certain expectations about the performance of different memory subsystems in a computer. My program confirms these.

  • Reading is usually faster than writing.
  • L1 cache accesses are significantly faster than L2 accesses e.g. by a factor of 2.
  • L1 cache accesses are much faster than main memory accesses e.g. by a factor of 5 or more.
  • L2 cache accesses are faster than main memory accesses e.g. by a factor of 3 or more.
  • L2 cache writing is usually significantly slower than L2 reading. This is because existing data in the cache has to be flushed out to main memory before it can be replaced.
  • If the L2 cache is in write-through mode then L2 writing will be very slow and more on par with main memory write speeds.
  • Main memory is slower to write than to read. This is just the nature of DRAM. It takes time to charge or discharge the capacitor that is in each DRAM memory cell whereas reading it is much faster.
  • Framebuffer accesses are usually much slower than main memory.
  • However framebuffer writing is usually faster than framebuffer reading.
  • C library memcpy and memset are often pretty slow; perhaps this is due to unaligned loads and stores and/or insufficient optimization.
  • Register-to-register transfers are the fastest possible transfers.
  • Register-to/from-stack are often half as fast as register-to-register transfers.

A historical addendum

One factor that reduces a computer's bandwidth is a write-through cache, be it L2 or L1. These were used in early Pentium-based computers but were quickly replaced with more efficient write-back caches.

SSE4 vector-to/from-register transfers

While transfers between the main registers and the XMM vector registers using the MOVD and MOVQ instructions perform well, transfers involving the PINSR* and PEXTR* instructions are slower than expected. In general, to move a 64-bit value into or out of an XMM register using MOVQ is twice as fast as using PINSRQ or PEXTRQ, suggesting a lack of optimization on Intel's part of the latter instructions.

What about ganged mode?

Let's say your motherboard supports dual-channel RAM operation. This means that your two DIMMs are managed together, providing the CPU with what appears to be a single 128-bit wide memory device.

Whether you are using dual-channel mode depends not only on your motherboard and chipset, but also on whether your BIOS is configured for it.

The default BIOS setting for this, referred to as the DCT or DCTs feature, is often unganged i.e. the two memory sticks are not acting together.

What is a DCT? This refers to a DRAM Controller. The fact that in the unganged mode each channel is independent means that there is need for a DCT for each channel. A motherboard and chipset supporting a wide path to your RAM will likely provide as many DCTs as there are channels, as needed for unganged mode.

In the BIOS settings you will either see a simple selection for ganged versus unganged mode, or it may refer to which actual DRAM controllers are assigned to which channels e.g. DCT0 (first DRAM controller) goes to channel A and and DCT1 (second controller) manages channel B.

If your computer doesn't have an old-style PC BIOS but rather uses UEFI like Apple devices do, you may not have the option to alter the ganged/unganged setting. So consumers are disempowered firstly in that settings may not be accessible, but secondly in that the details of how UEFI works are actually proprietary and subject to an NDA. Therefore: UEFI is bad for consumers.

Q: Does ganged mode actually improve speed?
A: People say it generally does not improve it, or reduce it, which is why it is not enabled by default. Ganged and unganged offer about the same performance for any given application running on one core.

Q: If unganged mode requires more silicon (one DCT per channel) but has the same performance as ganged mode, then why not enable ganged by default and remove the extra DRAM controllers?
A: Because unganged mode offers more flexibility in letting multiple cores and hyperthreads access different areas of memory at once.

Q: How can maximum performance be achieved realistically?
A: With the help of the OS, a program could allocate its in-memory data sets across the DIMMs (obviously this is a virtual memory issue) to avoid the bottleneck of all of its data going through just one channel.

Why is L1 cache read speed so amazingly fast on XYZ CPU?

One user showed me a bandwidth graph from the i5-2520M where for some reason, loading 128-bit values from the L1 cache sequentially into the XMM registers was running at an astounding 96 GB per second. Writing into L1 was much slower.

After a few calculations, it became clear why this was happening:

  1. 96 GB at the processor's peak speed of 3.2 GHz means a transfer speed of 32 bytes i.e. 256 bits per cycle. But the XMM registers are only 128 bits wide...
  2. Newer Intel CPUs have YMM registers however, used by Advanced Vector Extensions (AVX) instructions. These are 256 bits wide and are composed of the existing 16 XMM registers with an additional 128 bits per register for a combined 256 bits per register.
  3. However notice, my test is not loading YMM registers, it's loading XMM. What the microcode is doing is, per cycle 128 bits is being transferred to each of two separate registers, which in a straightforward hardware design would share the same input wires.

Therefore, Intel has designed the circuitry so that two XMM registers can be loaded in one cycle. It seems likely that either the L1-to-XMM/YMM path has been expanded to 256 bits or reading from L1 is possible at a rate of two per cycle (dual data rate). More like the latter case.

My Xeon has a 20 MB shared L3. Will it be fast?

A shared L3 means that if you have X cores and each is running a program that is very memory-intensive, like bandwidth or for instance the genetics program BLAST, the amount of L3 that each core can effectively use will be its fraction of the total. If Y is the number of megabytes of L3, this would be Y/X.

This was proven to me recently when a person with a dual-CPU Xeon 2690 system (20 MB L3, 8 cores and 4 channels per CPU) ran bandwidth on 8 cores out of 16, resulting in each core effectively having only 5 MB of L3. If he had been running bandwidth on all cores, obviously each would effectively have only 2.5 MB of L3 to use.

If one were to use an AMD Opteron with 10 cores and 30 MB of shared L3, the worst case situation would be each core having effectively 3 MB of L3, which is the same as a lowly consumer-grade Core i5.

Thus with the Xeon and Opteron used with a memory-intensive application running on each core, perhaps one's priority should be on:

  • Using the fastest possible RAM.
  • Choosing a CPU with a large L2 cache (Opteron is 512 kB but Xeon is 256 kB).
  • Organizing the data set as best as possible in shared memory.

Why is the 128-bit performance of the Turion so horribly bad?

It has been reported to me that AMD's Turion performs very poorly when running SSE instructions. I haven't had a chance to verify this myself to make sure it is real, but do note that Turion is the lowest-grade consumer product so it's no surprise that it's hobbled. Don't buy a Turion.