zsmith.co
Bandwidth: a memory bandwidth benchmark
for x86 / x86_64 based Linux/Windows/MacOSX
My iOS Apps

Open source

Documentation

Contact:
1 at zsmith dot co


Click here for downloads.
Click here for graphs.
Click here for the project page.

Introduction

My program, called simply bandwidth, is an artificial benchmark primarily for measuring memory bandwidth on x86 and x86_64 based computers, useful for identifying weaknesses in a computer's memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.

Despite the focus on memory testing, in release 0.24 I also added network bandwidth testing. The results are graphed.

bandwidth also tests some libc functions and, under Linux, it attempts to test framebuffer memory access speed if the framebuffer device is available.

This program is open source and covered by the GPL license. Although I wrote it mainly for my own benefit, I am also providing it pro bono, i.e. for the public good.

Change log

Release 1.1

This release adds a second font.

Release 1.0

This update separates out the graphing functionality and adds support for the LODS[BWDQ] instructions.

Release 0.32

My port of bandwidth to iOS & the ARM processor is the app iBenchmark, which includes several other performance tests such as integer and floating point math.
In the App Store: iBenchmark
A little support for AVX.

Release 0.31

This release adds printing of cache information for Intel processors in 32-bit mode.

Release 0.30

This release adds printing of cache information for Intel processors in 64-bit mode.

Release 0.29

Further improved granularity with addition of 128-byte tests. Removed ARM support.

Release 0.28

Added proper feature checks using the CPUID instruction.

Release 0.27

Added 128-byte chunk size tests to x86 processors to improve granularity, especially around the 512-byte dip seen on Intel CPUs.

Release 0.26

AMD processors don't support SSE4 vector instructions, so I updated bandwidth to not utilize those when running on AMD-based computers.

Release 0.25

This update, released December 17th, 2010, extended the network bandwidth testing as follows:
  1. Test is now bidirectional.
  2. Port# is specifiable.

Release 0.24

The new network bandwidth test is at present simple: It sends chunks of data of varying sizes to nodes, which respond when the entirely of the chunk is read. The time from start of send to receipt of response is used to calculate the bandwidth.

The node that runs the test is the leader and the others are the transponders.

The network test cannot be combined with the memory bandwidth tests.

Just to clarify, you need two computers to run the test:

  • Machine A: The computer that is running (leading) the test.
  • Machine B: The computer that is the transponder.
  • Sent data is from A to B.
  • Received data is from B to A.

Release 0.23

This latest bandwidth adds support for Mac OS/X, bringing the number supported operating systems to four:
  • Linux
  • Mac OS/X
  • 32-bit Windows
  • Windows Mobile ARM
And it already supported three processor architectures:
  • x86: performs 128- and 32-bit transfers
  • x86_64: performs 128- and 64-bit transfers

For each architecture, I've implemented optimized core routines in assembly.

Bandwidth 0.23 builds on the novelty of 0.22's register-to-register transfer speeds by including transfers to, from, and between vector registers (XMM). It also adds a test of memory copy speeds.

Revision 0.21 added now performs both sequential and random reading and writing of a range of progressively larger chunks of memory to permit you to effectively test several types of memory:

  • Level 1 cache
  • Level 2 cache
  • Main memory
Revision 0.21 also added randomized memory access, which shows how the system might perform in a real-world situation when running a memory-intensive program.

Revision 0.20 added a novel and helpful improvement: Graphing. Using my bmplib, the program generates an 1280x720-pixel graph depicting the results.

Revision 0.19 added another novel improvement: When it performs 128-bit writes using SSE2 instructions it does it in two ways:

  1. It writes into the caches with MOVDQA.
  2. It writes bypassing the caches w/MOVNTDQ.

Results from Macbook Pro with 2.4 GHz Core i5 520M and 1066 MHz RAM

How fast is each type of storage on a typical system? This is the kind of thing students of Computer Architecture are asked on final exams.

For my Macbook Pro, the numbers are as follows.

  • Reading from the Crucial m4 SSD: 250 MB/second.
  • Reading from main memory (1066 MHz Crucial): maximum 7 GB/second = 28 times faster.
  • Reading from L3 cache: maximum 21 GB/second = 3 times faster than main memory or 86 times faster than SSD.
  • Reading from L2 cache: maximum 29.5 GB/second = 1.4 times faster than L3; 4.2 times faster than main memory; or 120 times faster than SSD.
  • Reading from L1 cache: maximum 44.5 GB/second = 1.5 times faster than L2; 2.1 times faster than L3; 6.4 times faster than main memory; or 178 times faster than SSD.
And the SSD is up to 4 times faster than the original hard disk drive.

Results from iOS

My port of bandwidth to the ARM is included in my app iBenchmark. Here is a sample screenshot from the iPhone 3GS:

Observations of running one instance of bandwidth

The table below presents program output from recent and former versions of bandwidth juxtaposed. They all use the same core routines. These numbers cover only sequential accesses.

The first interesting thing to notice is the difference in performance between 32, 64, and 128 bit transfers on the same processor. These differences show that if programmers were to go through the trouble to revise software to use 64 or 128 bit transfers, where appropriate and especially making them aligned to appropriate byte boundaries and sequential, great speed-ups could be achieved.

A second observation is the importance of having fast DRAM. The latest DRAM overclocked can give stupendous results.

A third observation is the remarkable difference in speeds between memory types. In some cases the L1 cache is more than twice as fast as L2, and L1 is up to 9 times faster than main memory, whereas L2 is often 3 times faster than DRAM.

OS Transfer size PC Make/model CPU CPU speed Front-side bus speed L1 read MB/sec L1 write MB/sec L2 read MB/sec L2 write MB/sec Main read MB/sec Main write MB/sec Main memory RAM type/speed
Intel Linux 64 128 bits Intel Core i7-930 Overclock
4.27 GHz
2000 MHz 64900 65100 43200 39900 18400 12800 DDR3-2000
Mac OS/X Snow Leopard 128 bits Macbook Pro 15 2010 Intel Core i5-520M 2.4 GHz 1066 MHz 44500 44500 29600 27300 7100 5200 PC3-8500
Intel Linux 64 128 bits Lenovo Thinkpad T510 Intel Core i5-540M 2.53 GHz 1066 MHz 42000 42000 28500 26500 8000 3500 PC3-8500
Mac OS/X Snow Leopard 128 bits Macbook Pro MC374LL/A Intel Core 2 Duo P8600 2.4 GHz 1066 MHz 36500 34500 17000 14300 5620 5380* PC3-8500
Intel Linux 64 128 bits Thinkpad Edge 15 Intel Core i3-330M 2.13 GHz 1066 MHz 32110 32070 21380 19730 6390 2790 DDR3-1066
Intel Linux 64 128 bits Toshiba L505 Intel T4300 2.1 GHz 800 MHz 31930 30190 15000 12500 4828 4036* DDR2-800
Intel Linux 64 128 bits Toshiba A135 Intel Core 2 Duo T5200 1.6 GHz 533 MHz 24250 18970 9619 7237 2995. 2299. PC2-4200
Intel Linux 32 32 bits Lenovo 3000 N200 Celeron 550 2.0 GHz 533 MHz 7489 7125 6533 5007 2088 1290. PC2-5300
Intel Linux 32 32 bits Toshiba A205 Pentium Dual T2390 1.86 GHz 533 MHz 7098 6734 7095 5675 2146 1255 PC2-5300
Intel Linux 32 32 bits Acer 5810TZ-4761 Intel SU4100 1.3 GHz 800 MHz 4937 4682 4160. 3013 1803 1682 DDR3-1066
Intel Linux 32 32 bits Dell XPS T700r Pentium III 700 MHz 100 MHz 2629 2284 2607 1630. 448.5 163.7 PC100
ARM Linux 32 32 bits Sheevaplug Marvell Kirkwood ARM 1.2 GHz 3418. 529.0 469.6 859.1 396.0 546.1 DDR2
Windows Mobile 32 bits HTC Jade 100 Marvell ARM 624 MHz 2165. 483.7 130.7 434.5
Intel Linux 32 32 bits IBM Thinkpad 560E Pentium MMX 150 MHz Up to 66 MHz 500.7 75.49 520.6 74.81 86.64 74.32 EDO 60ns; 50 MHz
* = Rate for writing while bypassing caches.

Note: Since I added graphing to bandwidth, I am no longer updating this table.

Running multiple instances of bandwidth simultaneously

Is bandwidth actually showing the maximum bandwidth to and from main memory? There is an easy way to test this. We can run one instance of bandwidth on each core of a multi-core CPU (in my case, two instances, one for each core) and add up the access bandwidths to/from main memory for all instances to see whether they approach the published limits for our main memory system.

On my Core i5 dual-core system, with DDR3 (PC3-8500) memory, the maximum RAM bandwidth ought to be 8500 MB/second.

Running on just one core:

  • Reading, it maxes out at ~7050 MB/second from main memory.
  • Writing through the caches, it maxes out at ~5120 MB/second to main memory.
  • Writing and bypassing the caches, it maxes out at ~5520 MB/second to main memory.

Read:

Write, bypassing caches:

When I've got two instances of bandwidth running at the same time, one on each core, the picture is a little different but not much.

  • Reading, the total bandwidth from main memory is ~8000 MB/second, nearing the memory's maximum, or 14% faster than running just one instance of bandwidth.
  • Writing without bypassing the caches, the total bandwidth to main memory is ~5650 MB/second, which is 10% faster than one instance.
  • Writing with the cache bypass, the total bandwidth to main memory is ~6050 MB/second, which is 10% faster than one instance.

Read:

Write, not bypassing caches:

Write, bypassing caches:

Thus, to really ascertain the upper performance limit of the main memory, it behooves the earnest benchmarker to run multiple instances of bandwidth and sum the results.

Graphs for memory bandwidth tests

Click on a graph to enlarge it.

Intel Xeon E5-2690, rated at 2.9 GHz with Turbo Boost speed 3.8 Ghz, with two sticks of DDR3 1600 MHz RAM (rated 12.8 GB/s/channel but not using dual-channel mode), running 64-bit Linux.

Intel Core i5-2520M, rated at 2.5 GHz but running at the Turbo Boost speed 3.2 Ghz, with two sticks of DDR3 1333 MHz RAM (rated 10.6 GB/s/channel), running 64-bit Ubuntu 10.

Intel Core i5-520M at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:

Intel Celeron at 2.8 GHz with 128kB L2 cache and PC2700 DDR memory running 32-bit Linux 2.6:

Note the difference
Not using the nice command:
Using nice -n -2:

Intel Core 2 Duo P8600 at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:

Intel Core i5-540M at 2.53 to 3.07 GHz with 3MB L3 cache, running 64-bit Linux:

Intel Core i3-330M at 2.16 GHz with 3MB L3 cache, running 64-bit Linux:

Intel Pentium T4300 at 2.1 GHz with 1MB L2 cache, running 64-bit Linux:

Intel Core 2 Duo T5600 at 1.83 GHz, with 2MB L2 cache, in a Mac Mini:

Intel Core 2 Duo T5200 at 1.6 GHz, with 2MB L2 cache, running Linux 64-bit:

Intel Core 2 Duo E8400 at 3 GHz, with 6 MB L2 cache:

OMAP 3530 Cortex A8 in a Beagle Board running 32-bit Linux 2.6.29 at 720 MHz (turbo mode):

Graphs for network bandwidth tests

Here is a test of the loopback device on a 2.4 GHz Core i5 based Mac running the 32-bit OS/X 10.6.5 kernel and 64-bit bandwidth:

Here is another loopback test but for a 2.8 GHz Celeron running 32-bit Linux:

And here is the Mac communicating with the Linux box over Wifi:

Download

Be nice

On Linux, I recommend using nice -n -2 when running bandwidth. The kernel may attempt to throttle the process otherwise.

Compiling

On Mac OS/X, you will certainly need to upgrade to the latest NASM. Then compile with "make bandwidth-mac64" or "make bandwidth-mac32".

On Intel Linux, you need a copy of NASM and the GCC suite. A decent distro will supply these. Simply type "make bandwidth32" or "make bandwidth64" to produce the Intel executables.

Lastly, to compile for 32-bit desktop Windows you need the GCC toolchain in the form of Cygwin. Type "make bandwidth32". The executable will not run outside of Cygwin because it requires a Cygwin DLL.

Commentary

AVX qualification

AVX's registers and instructions actually require special handling that bandwidth currently does not provide. I'll be adding that in a future release. Despite this, bandwidth's AVX performance in some cases is very good. For more information on the issues related to using AVX, see this Intel document.

Sequential versus random memory access

Modern processor technology is optimized for predictable memory access behaviors, and sequential accesses are of course that. As the graphs above show, out-of-order accesses disrupt the cache contents, resulting in lower bandwidth. Such a result is more like real-world performance, albeit only for memory-intensive programs.

Generalizations about memory and register performance

One has certain expectations about the performance of different memory subsystems in a computer. My program confirms these.
  1. Reading is usually faster than writing.
  2. L1 cache accesses are significantly faster than L2 accesses e.g. by a factor of 2.
  3. L1 cache accesses are much faster than main memory accesses e.g. by a factor of 5 or more.
  4. L2 cache accesses are faster than main memory accesses e.g. by a factor of 3 or more.
  5. L2 cache writing is usually significantly slower than L2 reading. This is because existing data in the cache has to be flushed out to main memory before it can be replaced.
  6. If the L2 cache is in write-through mode then L2 writing will be very slow and more on par with main memory write speeds.
  7. Main memory is slower to write than to read. This is just the nature of DRAM. It takes time to charge or discharge the capacitor that is in each DRAM memory cell whereas reading it is much faster.
  8. Framebuffer accesses are usually much slower than main memory.
  9. However framebuffer writing is usually faster than framebuffer reading.
  10. C library memcpy and memset are often pretty slow; perhaps this is due to unaligned loads and stores and/or insufficient optimization.
  11. Register-to-register transfers are the fastest possible transfers.
  12. Register-to/from-stack are often half as fast as register-to-register transfers.

A historical addendum

One factor that reduces a computer's bandwidth is a write-through cache, be it L2 or L1. These were used in early Pentium-based computers but were quickly replaced with more efficient write-back caches.

SSE4 vector-to/from-register transfers

While transfers between the main registers and the XMM vector registers using the MOVD and MOVQ instructions perform well, transfers involving the PINSR* and PEXTR* instructions are slower than expected. In general, to move a 64-bit value into or out of an XMM register using MOVQ is twice as fast as using PINSRQ or PEXTRQ, suggesting a lack of optimization on Intel's part of the latter instructions.

What about ganged mode?

Let's say your motherboard supports dual-channel RAM operation. This means that your two DIMMs are managed together, providing the CPU with what appears to be a single 128-bit wide memory device.

Whether you are using dual-channel mode depends not only on your motherboard and chipset, but also on whether your BIOS is configured for it.

The default BIOS setting for this, referred to as the DCT or DCTs feature, is often unganged i.e. the two memory sticks are not acting together.

What is a DCT? This refers to a DRAM Controller. The fact that in the unganged mode each channel is independent means that there is need for a DCT for each channel. A motherboard and chipset supporting a wide path to your RAM will likely provide as many DCTs as there are channels, as needed for unganged mode.

In the BIOS settings you will either see a simple selection for ganged versus unganged mode, or it may refer to which actual DRAM controllers are assigned to which channels e.g. DCT0 (first DRAM controller) goes to channel A and and DCT1 (second controller) manages channel B.

If your computer doesn't have an old-style PC BIOS but rather uses UEFI like Apple devices do, you may not have the option to alter the ganged/unganged setting. So consumers are disempowered firstly in that settings may not be accessible, but secondly in that the details of how UEFI works are actually proprietary and subject to an NDA. Therefore: UEFI is bad for consumers.

Q: Does ganged mode actually improve speed?
A: People say it generally does not improve it, or reduce it, which is why it is not enabled by default. Ganged and unganged offer about the same performance for any given application running on one core.

Q: If unganged mode requires more silicon (one DCT per channel) but has the same performance as ganged mode, then why not enable ganged by default and remove the extra DRAM controllers?
A: Because unganged mode offers more flexibility in letting multiple cores and hyperthreads access different areas of memory at once.

Q: How can maximum performance be achieved realistically?
A: With the help of the OS, a program could allocate its in-memory data sets across the DIMMs (obviously this is a virtual memory issue) to avoid the bottleneck of all of its data going through just one channel.

Why is L1 cache read speed so amazingly fast on XYZ CPU?

One user showed me a bandwidth graph from the i5-2520M where for some reason, loading 128-bit values from the L1 cache sequentially into the XMM registers was running at an astounding 96 GB per second. Writing into L1 was much slower. Here is that graph:

After a few calculations, it became clear why this was happening:

1. 96 GB at the processor's peak speed of 3.2 GHz means a transfer speed of 32 bytes i.e. 256 bits per cycle. But the XMM registers are only 128 bits wide...

2. Newer Intel CPUs have YMM registers however, used by Advanced Vector Extensions (AVX) instructions. These are 256 bits wide and are composed of the existing 16 XMM registers with an additional 128 bits per register for a combined 256 bits per register.

3. However notice, my test is not loading YMM registers, it's loading XMM. What the microcode is doing is, per cycle 128 bits is being transferred to each of two separate registers, which in a straightforward hardware design would share the same input wires.

Therefore, Intel has designed the circuitry so that two XMM registers can be loaded in one cycle. It seems likely that either the L1-to-XMM/YMM path has been expanded to 256 bits or reading from L1 is possible at a rate of two per cycle (dual data rate). More like the latter case.

My Xeon has a 20 MB shared L3. Will it be fast?

A shared L3 means that if you have X cores and each is running a program that is very memory-intensive, like bandwidth or for instance the genetics program BLAST, the amount of L3 that each core can effectively use will be its fraction of the total. If Y is the number of megabytes of L3, this would be Y/X.

This was proven to me recently when a person with a dual-CPU Xeon 2690 system (20 MB L3, 8 cores and 4 channels per CPU) ran bandwidth on 8 cores out of 16, resulting in each core effectively having only 5 MB of L3. If he had been running bandwidth on all cores, obviously each would effectively have only 2.5 MB of L3 to use.

If one were to use an AMD Opteron with 10 cores and 30 MB of shared L3, the worst case situation would be each core having effectively 3 MB of L3, which is the same as a lowly consumer-grade Core i5.

Thus with the Xeon and Opteron used with a memory-intensive application running on each core, perhaps one's priority should be on:

  • Using the fastest possible RAM.
  • Choosing a CPU with a large L2 cache (Opteron is 512 kB but Xeon is 256 kB).
  • Organizing the data set as best as possible in shared memory.

Why is the 128-bit performance of the Turion so horribly bad?

It has been reported to me that AMD's Turion performs very poorly when running SSE instructions. I haven't had a chance to verify this myself to make sure it is real, but do note that Turion is the lowest-grade consumer product so it's no surprise that it's hobbled. Don't buy a Turion.