© 2012-2020 by Zack Smith. All rights reserved.
My program, called simply bandwidth, is an artificial benchmark primarily for measuring memory bandwidth on x86 and x86_64 based computers, useful for identifying weaknesses in a computer's memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.
bandwidth also tests some libc functions and, under GNU/Linux, it attempts to test framebuffer memory access speed if the framebuffer device is available.
This program is open source and covered by the GPL license. Although I wrote it mainly for my own benefit, I am also providing it pro bono, i.e. for the public good.
Release 1.9More object-oriented improvements.
Release 1.8More object-oriented improvements. Windows 64-bit support.
Release 1.7Isolated Object-Oriented C library.
Release 1.6Updated to use object-oriented C. Fixed Raspberry pi support.
Release 1.5Improved 256-bit routines. Added --nice switch.
Release 1.4I added randomized 256-bit routines for 64-bit Intel CPUs.
Release 1.3I added CSV output. I updated the ARM code for the Raspberry pi 3 (AArch32).
Release 1.2I put my old 32-bit ARM code back in for generic ARM systems.
Release 1.1This release adds a second, larger font.
Release 1.0This update separates out the graphing functionality. It also adds tests for the LODS[BWDQ] instructions, because while it is common knowledge that these instructions are slow and useless, sometimes widely-held beliefs are wrong, so I added this test which proves just how dramatically slow LODS instructions are.
Release 0.32A little support for AVX.
Release 0.31This release adds printing of cache information for Intel processors in 32-bit mode.
Release 0.30This release adds printing of cache information for Intel processors in 64-bit mode.
Release 0.29Further improved granularity with addition of 128-byte tests. Removed ARM support.
Release 0.28Added proper feature checks using the CPUID instruction.
Release 0.27Added 128-byte chunk size tests to x86 processors to improve granularity, especially around the 512-byte dip seen on Intel CPUs.
Release 0.26AMD processors don't support SSE4 vector instructions, so I updated bandwidth to not utilize those when running on AMD-based computers.
Release 0.25This update, released December 17th, 2010, extended the network bandwidth testing.
Release 0.24Added a network bandwidth test.
The latest bandwidth adds support for 64-bit Windows, so it now supports:
- 32- and 64-bit GNU/Linux
- 32- and 64-bit Windows
- 64-bit Mac OS/X
- Raspberry pi ARM running 32-bit Raspbian
- Generic ARM
And it already supported three processor architectures:
- x86: performs 128- and 32-bit transfers
- x86_64: performs 128- and 64-bit transfers
- ARM 32-bit.
Why write the core routines in assembly language?
For each architecture, I've implemented optimized core routines in assembly language.
Because the exact same core assembly language routines run on all computers of a given architecture, it is similar to the same ruler being used to measure multiple items.
This is the crucial approach. If the core routines had been written in C or C++, the final code that is executed would differ depending on the verson of the compiler and compilation options and measurements could not be used for valid comparisons.
Results from Macbook Pro with 2.4 GHz Core i5 520M and 1066 MHz RAM
How fast is each type of storage on a typical system? This is the kind of thing students of Computer Architecture are asked on final exams.
For my Macbook Pro, the numbers are as follows.
- Reading from the Crucial m4 SSD: 250 MB/second.
- Reading from main memory (1066 MHz Crucial): maximum 7 GB/second = 28 times faster.
- Reading from L3 cache: maximum 21 GB/second = 3 times faster than main memory or 86 times faster than SSD.
- Reading from L2 cache: maximum 29.5 GB/second = 1.4 times faster than L3; 4.2 times faster than main memory; or 120 times faster than SSD.
- Reading from L1 cache: maximum 44.5 GB/second = 1.5 times faster than L2; 2.1 times faster than L3; 6.4 times faster than main memory; or 178 times faster than SSD.
And the SSD is up to 4 times faster than the original hard disk drive.
Observations of running one instance of bandwidth
The table below presents program output from recent and former versions of
juxtaposed. They all use the same core routines. These numbers cover only sequential accesses.
The first interesting thing to notice is the difference in performance between 32, 64, and 128 bit transfers on the same processor. These differences show that if programmers were to go through the trouble to revise software to use 64 or 128 bit transfers, where appropriate and especially making them aligned to appropriate byte boundaries and sequential, great speed-ups could be achieved.
A second observation is the importance of having fast DRAM. The latest DRAM overclocked can give stupendous results.
A third observation is the remarkable difference in speeds between memory types. In some cases the L1 cache is more than twice as fast as L2, and L1 is up to 9 times faster than main memory, whereas L2 is often 3 times faster than DRAM.
|OS||Transfer size||PC Make/model||CPU||CPU speed||Front-side bus speed||L1 read MB/sec||L1 write MB/sec||L2 read MB/sec||L2 write MB/sec||Main read MB/sec||Main write MB/sec||Main memory RAM type/speed|
|Intel GNU Linux 64||128 bits||Intel Core i7-930||Overclock|
|4.27 GHz||2000 MHz||64900||65100||43200||39900||18400||12800||DDR3-2000|
|Mac OS/X Snow Leopard||128 bits||Macbook Pro 15 2010||Intel Core i5-520M||2.4 GHz||1066 MHz||44500||44500||29600||27300||7100||5200||PC3-8500|
|Intel GNU/Linux 64||128 bits||Lenovo Thinkpad T510||Intel Core i5-540M||2.53 GHz||1066 MHz||42000||42000||28500||26500||8000||3500||PC3-8500|
|Mac OS/X Snow Leopard||128 bits||Macbook Pro MC374LL/A||Intel Core 2 Duo P8600||2.4 GHz||1066 MHz||36500||34500||17000||14300||5620||5380*||PC3-8500|
|Intel GNU/Linux 64||128 bits||Thinkpad Edge 15||Intel Core i3-330M||2.13 GHz||1066 MHz||32110||32070||21380||19730||6390||2790||DDR3-1066|
|Intel GNU/Linux 64||128 bits||Toshiba L505||Intel T4300||2.1 GHz||800 MHz||31930||30190||15000||12500||4828||4036*||DDR2-800|
|Intel GNU/Linux 64||128 bits||Toshiba A135||Intel Core 2 Duo T5200||1.6 GHz||533 MHz||24250||18970||9619||7237||2995.||2299.||PC2-4200|
|Intel GNU/Linux 32||32 bits||Lenovo 3000 N200||Celeron 550||2.0 GHz||533 MHz||7489||7125||6533||5007||2088||1290.||PC2-5300|
|Intel GNU/Linux 32||32 bits||Toshiba A205||Pentium Dual T2390||1.86 GHz||533 MHz||7098||6734||7095||5675||2146||1255||PC2-5300|
|Intel GNU/Linux 32||32 bits||Acer 5810TZ-4761||Intel SU4100||1.3 GHz||800 MHz||4937||4682||4160.||3013||1803||1682||DDR3-1066|
|Intel GNU/Linux 32||32 bits||Dell XPS T700r||Pentium III||700 MHz||100 MHz||2629||2284||2607||1630.||448.5||163.7||PC100|
|ARM GNU/Linux 32||32 bits||Sheevaplug||Marvell Kirkwood ARM||1.2 GHz||3418.||529.0||469.6||859.1||396.0||546.1||DDR2|
|Windows Mobile||32 bits||HTC Jade 100||Marvell ARM||624 MHz||2165.||483.7||130.7||434.5|
|Intel GNU/Linux 32||32 bits||IBM Thinkpad 560E||Pentium MMX||150 MHz||Up to 66 MHz||500.7||75.49||520.6||74.81||86.64||74.32||EDO 60ns; 50 MHz|
- = Rate for writing while bypassing caches.
Note: Since I added graphing to bandwidth, I am no longer updating this table.
Running multiple instances of bandwidth simultaneously
bandwidth actually showing the maximum bandwidth to and from main memory? There is an easy way to test this. We can run one instance of bandwidth on each core of a multi-core CPU (in my case, two instances, one for each core) and add up the access bandwidths to/from main memory for all instances to see whether they approach the published limits for our main memory system.
On my Core i5 dual-core system, with DDR3 (PC3-8500) memory, the maximum RAM bandwidth ought to be 8500 MB/second.
Running on just one core:
- Reading, it maxes out at 7050 MB/second from main memory.
- Writing through the caches, it maxes out at 5120 MB/second to main memory.
- Writing and bypassing the caches, it maxes out at 5520 MB/second to main memory.
When I've got two instances of bandwidth running at the same time, one on each core, the picture is a little different but not much.
- Reading, the total bandwidth from main memory is 8000 MB/second, nearing the memory's maximum, or 14% faster than running just one instance of bandwidth.
- Writing without bypassing the caches, the total bandwidth to main memory is 5650 MB/second, which is 10% faster than one instance.
- Writing with the cache bypass, the total bandwidth to main memory is 6050 MB/second, which is 10% faster than one instance.
Thus, to really ascertain the upper performance limit of the main memory, it behooves the earnest benchmarker to run multiple instances of bandwidth and sum the results.
Graphs for memory bandwidth tests
Xeon E5-2630 v4 running at 2.2GHz, no TurboBoost, with 25MB smart cache and 128GB RAM Quad-Channel DDR4-2400, running GNU/Linux.
Intel Xeon E5-2690, rated at 2.9 GHz with Turbo Boost speed 3.8 Ghz, with two sticks of DDR3 1600 MHz RAM (rated 12.8 GB/s/channel but not using dual-channel mode), running 64-bit GNU/Linux.
Intel Core i5-2520M, rated at 2.5 GHz but running at the Turbo Boost speed 3.2 Ghz, with two sticks of DDR3 1333 MHz RAM (rated 10.6 GB/s/channel), running 64-bit Ubuntu 10.
Intel Core i5-520M at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:
Intel Core 2 Duo P8600 at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:
Intel Core i5-540M at 2.53 to 3.07 GHz with 3MB L3 cache, running 64-bit GNU/Linux:
Intel Core i3-330M at 2.16 GHz with 3MB L3 cache, running 64-bit GNU/Linux:
Intel Pentium T4300 at 2.1 GHz with 1MB L2 cache, running 64-bit GNU/Linux:
Intel Core 2 Duo T5600 at 1.83 GHz, with 2MB L2 cache, in a Mac Mini:
Intel Core 2 Duo T5200 at 1.6 GHz, with 2MB L2 cache, running GNU/Linux 64-bit:
Intel Core 2 Duo E8400 at 3 GHz, with 6 MB L2 cache:
OMAP 3530 Cortex A8 in a Beagle Board running 32-bit GNU/Linux 2.6.29 at 720 MHz (turbo mode):
- What changed:
- Fixed bug affecting Linux.
Support for AVX512 was not completed and if your CPU offers that feature you might observe a crash. I don't have time to fix this, but if anyone would like to provide a patch I can make it available here.
On GNU/Linux, I recommend using nice -n -2 when running bandwidth. The kernel may attempt to throttle the process otherwise.
On Mac OS/X, you will certainly need to upgrade to the latest NASM. Then compile with make bandwidth64. Note that the latest MacOS does not allow 32-bit compilation.
On GNU/Linux, you need a copy of NASM and the GCC suite. A decent distro will supply these. Simply type make bandwidth32 or make bandwidth64 to produce the Intel executables.
- 64-bit: To compile on 64-bit Windows type make bandwidth64.
- 32-bit: Because Cygwin is no longer available for 32-bit Windows you may have to use MinGW. To compile for 32-bit Windows, type make bandwidth32.
Max Memory Bandwidth number
When Intel says you can achieve a
Max Memory Bandwidth of e.g. 68 GB/sec from your 18-core processor, what they mean is the upper combined limit for all cores. To test this, you can run multiple copies of my bandwidth utility simultaneously, then add up the bandwidth values from each core accessing main memory. Each individual core may achieve quite a bit less bandwidth going to main memory. That's OK.
This larger number may at first seem like a marketing gimmick from Intel but it's a good number to know because when your system is extremely busy, this is the upper limit that will contrain all the cores' combined activity. What Intel should also do is give the per-core maximum alongside the collective maximum.
The impact of an L4 cache
Level 4 caches are ostensibly for improving graphics performance, the idea being that the GPU shares it with the CPU. But does it impact on CPU performance?
bandwidth user Michael V. provided a graph that shows that it does for the Intel Core i7 4750HQ. The 128MB L4 cache appears to be roughly twice as fast as main memory.
I have reinstated ARM support but mainly for the Raspberry pi 3. An earlier release of bandwidth supported 32-bit ARM CPUs found in Windows Mobile phones and iOS devices. There is a lot of variability in ARM CPUs in terms of what instructions are supported, so I don't plan to expand ARM support very much beyond the Rpi series.
Sequential versus random memory access
Modern processor technology is optimized for predictable memory access behaviors, and sequential accesses are of course that. As the graphs above show, out-of-order accesses disrupt the cache contents, resulting in lower bandwidth. Such a result is more like real-world performance, albeit only for memory-intensive programs.
Generalizations about memory and register performance
One has certain expectations about the performance of different memory subsystems in a computer. My program confirms these.
- Reading is usually faster than writing.
- L1 cache accesses are significantly faster than L2 accesses e.g. by a factor of 2.
- L1 cache accesses are much faster than main memory accesses e.g. by a factor of 5 or more.
- L2 cache accesses are faster than main memory accesses e.g. by a factor of 3 or more.
- L2 cache writing is usually significantly slower than L2 reading. This is because existing data in the cache has to be flushed out to main memory before it can be replaced.
- If the L2 cache is in write-through mode then L2 writing will be very slow and more on par with main memory write speeds.
- Main memory is slower to write than to read. This is just the nature of DRAM. It takes time to charge or discharge the capacitor that is in each DRAM memory cell whereas reading it is much faster.
- Framebuffer accesses are usually much slower than main memory.
- However framebuffer writing is usually faster than framebuffer reading.
- C library memcpy and memset are often pretty slow; perhaps this is due to unaligned loads and stores and/or insufficient optimization.
- Register-to-register transfers are the fastest possible transfers.
- Register-to/from-stack are often half as fast as register-to-register transfers.
A historical addendum
One factor that reduces a computer's bandwidth is a write-through cache, be it L2 or L1. These were used in early Pentium-based computers but were quickly replaced with more efficient write-back caches.
SSE4 vector-to/from-register transfers
While transfers between the main registers and the XMM vector registers using the MOVD and MOVQ instructions perform well, transfers involving the PINSR* and PEXTR* instructions are slower than expected. In general, to move a 64-bit value into or out of an XMM register using MOVQ is twice as fast as using PINSRQ or PEXTRQ, suggesting a lack of optimization on Intel's part of the latter instructions.
Let's say your motherboard supports dual-channel RAM operation. This means that your two DIMMs are managed together, providing the CPU with what appears to be a single 128-bit wide memory device.
Whether you are using dual-channel mode depends not only on your motherboard and chipset, but also on whether your BIOS is configured for it.
The default BIOS setting for this, referred to as the DCT or DCTs feature, is often
unganged i.e. the two memory sticks are not acting together.
What is a DCT? This refers to a DRAM Controller. The fact that in the unganged mode each channel is independent means that there is need for a DCT for each channel. A motherboard and chipset supporting a wide path to your RAM will likely provide as many DCTs as there are channels, as needed for unganged mode.
In the BIOS settings you will either see a simple selection for ganged versus unganged mode, or it may refer to which actual DRAM controllers are assigned to which channels e.g. DCT0 (first DRAM controller) goes to channel A and and DCT1 (second controller) manages channel B.
If your computer doesn't have an old-style PC BIOS but rather uses UEFI like Apple devices do, you may not have the option to alter the ganged/unganged setting. So consumers are disempowered firstly in that settings may not be accessible, but secondly in that the details of how UEFI works are actually proprietary and subject to an NDA. Therefore: UEFI is bad for consumers.
Q: Does ganged mode actually improve speed?
A: People say it generally does not improve it, or reduce it, which is why it is not enabled by default. Ganged and unganged offer about the same performance for any given application running on one core.
Q: If unganged mode requires more silicon (one DCT per channel) but has the same performance as ganged mode, then why not enable ganged by default and remove the extra DRAM controllers?
A: Because unganged mode offers more flexibility in letting multiple cores and hyperthreads access different areas of memory at once.
Q: How can maximum performance be achieved realistically?
A: With the help of the OS, a program could allocate its in-memory data sets across the DIMMs (obviously this is a virtual memory issue) to avoid the bottleneck of all of its data going through just one channel.
Why is L1 cache read speed so amazingly fast on XYZ CPU?
One user showed me a bandwidth graph from the i5-2520M where for some reason, loading 128-bit values from the L1 cache sequentially into the XMM registers was running at an astounding 96 GB per second. Writing into L1 was much slower.
After a few calculations, it became clear why this was happening:
- 96 GB at the processor's peak speed of 3.2 GHz means a transfer speed of 32 bytes i.e. 256 bits per cycle. But the XMM registers are only 128 bits wide...
- Newer Intel CPUs have YMM registers however, used by Advanced Vector Extensions (AVX) instructions. These are 256 bits wide and are composed of the existing 16 XMM registers with an additional 128 bits per register for a combined 256 bits per register.
- However notice, my test is not loading YMM registers, it's loading XMM. What the microcode is doing is, per cycle 128 bits is being transferred to each of two separate registers, which in a straightforward hardware design would share the same input wires.
Therefore, Intel has designed the circuitry so that two XMM registers can be loaded in one cycle. It seems likely that either the L1-to-XMM/YMM path has been expanded to 256 bits or reading from L1 is possible at a rate of two per cycle (dual data rate). More like the latter case.
My Xeon has a 20 MB shared L3. Will it be fast?
A shared L3 means that if you have X cores and each is running a program that is very memory-intensive, like bandwidth or for instance the genetics program BLAST, the amount of L3 that each core can effectively use will be its fraction of the total. If Y is the number of megabytes of L3, this would be Y/X.
This was proven to me recently when a person with a dual-CPU Xeon 2690 system (20 MB L3, 8 cores and 4 channels per CPU) ran bandwidth on 8 cores out of 16, resulting in each core effectively having only 5 MB of L3. If he had been running bandwidth on all cores, obviously each would effectively have only 2.5 MB of L3 to use.
If one were to use an AMD Opteron with 10 cores and 30 MB of shared L3, the worst case situation would be each core having effectively 3 MB of L3, which is the same as a lowly consumer-grade Core i5.
Thus with the Xeon and Opteron used with a memory-intensive application running on each core, perhaps one's priority should be on:
- Using the fastest possible RAM.
- Choosing a CPU with a large L2 cache (Opteron is 512 kB but Xeon is 256 kB).
- Organizing the data set as best as possible in shared memory.
Why is the 128-bit performance of the Turion so horribly bad?
It has been reported to me that AMD's Turion performs very poorly when running SSE instructions. I haven't had a chance to verify this myself to make sure it is real, but do note that Turion is the lowest-grade consumer product so it's no surprise that it's hobbled. Don't buy a Turion.