| Bandwidth: a memory bandwidth benchmark for x86 / x86_64 based Linux/Windows/MacOSX | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
My iOS Apps Open source
Documentation Contact: veritasNOSPAM @ comcastNOSPAM . net | Click here for downloads. Click here for graphs. IntroductionMy program, called simply bandwidth, is an artificial benchmark primarily for measuring memory bandwidth on x86 and x86_64 based computers, useful for identifying weaknesses in a computer's memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.Despite the focus on memory testing, in release 0.24 I also added network bandwidth testing. The results are graphed.
This program is open source and covered by the GPL license. Although I wrote it mainly for my own benefit, I am also providing it pro bono, i.e. for the public good. Change logRelease 0.32
Release 0.31This release adds printing of cache information for Intel processors in 32-bit mode.Release 0.30This release adds printing of cache information for Intel processors in 64-bit mode.Release 0.29Further improved granularity with addition of 128-byte tests. Removed ARM support.Release 0.28Added proper feature checks using the CPUID instruction.Release 0.27Added 128-byte chunk size tests to x86 processors to improve granularity, especially around the 512-byte dip seen on Intel CPUs.Release 0.26AMD processors don't support SSE4 vector instructions, so I updatedbandwidth to not utilize those when
running on AMD-based computers.
Release 0.25This update, released December 17th, 2010, extended the network bandwidth testing as follows:
Release 0.24The new network bandwidth test is at present simple: It sends chunks of data of varying sizes to nodes, which respond when the entirely of the chunk is read. The time from start of send to receipt of response is used to calculate the bandwidth.The node that runs the test is the leader and the others are the transponders. The network test cannot be combined with the memory bandwidth tests. Just to clarify, you need two computers to run the test:
Release 0.23This latestbandwidth adds support for Mac OS/X, bringing the
number supported operating systems to four:
For each architecture, I've implemented optimized core routines in assembly. Bandwidth 0.23 builds on the novelty of 0.22's register-to-register transfer speeds by including transfers to, from, and between vector registers (XMM). It also adds a test of memory copy speeds. Revision 0.21 added now performs both sequential and random reading and writing of a range of progressively larger chunks of memory to permit you to effectively test several types of memory:
Revision 0.20 added a novel and helpful improvement: Graphing.
Using my
Revision 0.19 added another novel improvement: When it performs 128-bit writes using SSE2 instructions it does it in two ways:
Results from iOSMy port ofbandwidth to the ARM is included in my app
iBenchmark. Here is a sample screenshot
from the iPhone 3GS:
Observations of running one instance of
The table below presents program output from recent
and former versions of |
| OS | Transfer size | PC Make/model | CPU | CPU speed | Front-side bus speed | L1 read MB/sec | L1 write MB/sec | L2 read MB/sec | L2 write MB/sec | Main read MB/sec | Main write MB/sec | Main memory RAM type/speed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Intel Linux 64 | 128 bits | Intel Core i7-930 | Overclock 4.27 GHz |
2000 MHz | 64900 | 65100 | 43200 | 39900 | 18400 | 12800 | DDR3-2000 | |
| Mac OS/X Snow Leopard | 128 bits | Macbook Pro 15 2010 | Intel Core i5-520M | 2.4 GHz | 1066 MHz | 44500 | 44500 | 29600 | 27300 | 7100 | 5200 | PC3-8500 |
| Intel Linux 64 | 128 bits | Lenovo Thinkpad T510 | Intel Core i5-540M | 2.53 GHz | 1066 MHz | 42000 | 42000 | 28500 | 26500 | 8000 | 3500 | PC3-8500 |
| Mac OS/X Snow Leopard | 128 bits | Macbook Pro MC374LL/A | Intel Core 2 Duo P8600 | 2.4 GHz | 1066 MHz | 36500 | 34500 | 17000 | 14300 | 5620 | 5380* | PC3-8500 |
| Intel Linux 64 | 128 bits | Thinkpad Edge 15 | Intel Core i3-330M | 2.13 GHz | 1066 MHz | 32110 | 32070 | 21380 | 19730 | 6390 | 2790 | DDR3-1066 |
| Intel Linux 64 | 128 bits | Toshiba L505 | Intel T4300 | 2.1 GHz | 800 MHz | 31930 | 30190 | 15000 | 12500 | 4828 | 4036* | DDR2-800 |
| Intel Linux 64 | 128 bits | Toshiba A135 | Intel Core 2 Duo T5200 | 1.6 GHz | 533 MHz | 24250 | 18970 | 9619 | 7237 | 2995. | 2299. | PC2-4200 |
| Intel Linux 32 | 32 bits | Lenovo 3000 N200 | Celeron 550 | 2.0 GHz | 533 MHz | 7489 | 7125 | 6533 | 5007 | 2088 | 1290. | PC2-5300 |
| Intel Linux 32 | 32 bits | Toshiba A205 | Pentium Dual T2390 | 1.86 GHz | 533 MHz | 7098 | 6734 | 7095 | 5675 | 2146 | 1255 | PC2-5300 |
| Intel Linux 32 | 32 bits | Acer 5810TZ-4761 | Intel SU4100 | 1.3 GHz | 800 MHz | 4937 | 4682 | 4160. | 3013 | 1803 | 1682 | DDR3-1066 |
| Intel Linux 32 | 32 bits | Dell XPS T700r | Pentium III | 700 MHz | 100 MHz | 2629 | 2284 | 2607 | 1630. | 448.5 | 163.7 | PC100 |
| ARM Linux 32 | 32 bits | Sheevaplug | Marvell Kirkwood ARM | 1.2 GHz | 3418. | 529.0 | 469.6 | 859.1 | 396.0 | 546.1 | DDR2 | |
| Windows Mobile | 32 bits | HTC Jade 100 | Marvell ARM | 624 MHz | 2165. | 483.7 | 130.7 | 434.5 | ||||
| Intel Linux 32 | 32 bits | IBM Thinkpad 560E | Pentium MMX | 150 MHz | Up to 66 MHz | 500.7 | 75.49 | 520.6 | 74.81 | 86.64 | 74.32 | EDO 60ns; 50 MHz |
Note: Since I added graphing to
bandwidth, I am no longer updating this table.
bandwidth simultaneously bandwidth actually showing the maximum bandwidth to and from
main memory?
There is an easy way to test this.
We can run one instance of bandwidth on each core of a
multi-core CPU
(in my case, two instances, one for each core)
and add up the access bandwidths to/from main memory for all instances
to see whether they approach the published limits for our main memory system.
On my Core i5 dual-core system, with DDR3 (PC3-8500) memory, the maximum RAM bandwidth ought to be 8500 MB/second.
Running on just one core:
When I've got two instances of
bandwidth running at the same time, one on each core,
the picture is a little different but not much.
bandwidth.
Thus, to really ascertain the upper performance limit of
the main memory, it behooves the earnest benchmarker
to run multiple
instances of bandwidth and sum the results.
Intel Xeon E5-2690, rated at 2.9 GHz with Turbo Boost speed 3.8 Ghz, with
two sticks of DDR3 1600 MHz RAM (rated 12.8 GB/s/channel but not using dual-channel mode), running 64-bit Linux.
Intel Core i5-2520M, rated at 2.5 GHz but running at the Turbo Boost speed 3.2 Ghz, with
two sticks of DDR3 1333 MHz RAM (rated 10.6 GB/s/channel), running 64-bit Ubuntu 10.
Intel Core i5-520M
at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:
Intel Celeron at 2.8 GHz with 128kB L2 cache and PC2700 DDR memory
running 32-bit Linux 2.6:
Not using the nicecommand: |
Using nice -n -2: |
Intel Core 2 Duo P8600
at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:
Intel Core i5-540M
at 2.53 to 3.07 GHz with 3MB L3 cache, running 64-bit Linux:
Intel Core i3-330M
at 2.16 GHz with 3MB L3 cache, running 64-bit Linux:
Intel Pentium T4300
at 2.1 GHz with 1MB L2 cache, running 64-bit Linux:
Intel Core 2 Duo T5600
at 1.83 GHz, with 2MB L2 cache, in a Mac Mini:
Intel Core 2 Duo T5200
at 1.6 GHz, with 2MB L2 cache, running Linux 64-bit:
Intel Core 2 Duo E8400
at 3 GHz, with 6 MB L2 cache:
OMAP 3530 Cortex A8 in a Beagle Board running 32-bit Linux 2.6.29 at 720 MHz (turbo mode):
bandwidth:
Here is another loopback test but for a 2.8 GHz Celeron running 32-bit Linux:
And here is the Mac communicating with the Linux box over Wifi:
nice -n -2 when running bandwidth.
The kernel may attempt to throttle the process
otherwise.
On Intel Linux, you need a copy of NASM and the GCC suite. A decent distro will supply these. Simply type "make bandwidth32" or "make bandwidth64" to produce the Intel executables.
Lastly, to compile for 32-bit desktop Windows you need the GCC toolchain in the form of Cygwin. Type "make bandwidth32". The executable will not run outside of Cygwin because it requires a Cygwin DLL.
bandwidth currently does not
provide. Since I don't at present have a computer that supports
AVX there will be a delay before I can perfect
the AVX aspects of bandwidth,
specifically until after I upgrade.
For more information on the issues related to using AVX, see
this Intel document.
gangedmode?
dual-channelRAM operation. This means that your two DIMMs are managed together, providing the CPU with what appears to be a single 128-bit wide memory device.
Whether you are using dual-channel mode depends not only on your motherboard and chipset, but also on whether your BIOS is configured for it.
The default BIOS setting for this,
referred to as the DCT or DCTs feature,
is often unganged
i.e.
the two memory sticks are not acting together.
What is a DCT? This refers to a DRAM Controller. The fact that in the unganged mode each channel is independent means that there is need for a DCT for each channel. A motherboard and chipset supporting a wide path to your RAM will likely provide as many DCTs as there are channels, as needed for unganged mode.
In the BIOS settings you will either see a simple selection for ganged versus unganged mode, or it may refer to which actual DRAM controllers are assigned to which channels e.g. DCT0 (first DRAM controller) goes to channel A and and DCT1 (second controller) manages channel B.
If your computer doesn't have an old-style PC BIOS but rather uses UEFI like Apple devices do, you may not have the option to alter the ganged/unganged setting. So consumers are disempowered firstly in that settings may not be accessible, but secondly in that the details of how UEFI works are actually proprietary and subject to an NDA. Therefore: UEFI is bad for consumers.
Q:
Does ganged mode actually improve speed?
A:
People say it generally does not improve it, or reduce it, which is why
it is not enabled by default.
Ganged and unganged offer about the same performance
for any given application running on one core.
Q:
If unganged mode requires more silicon (one DCT per channel)
but has the same performance as ganged mode, then why
not enable ganged by default and remove the extra DRAM controllers?
A:
Because unganged mode offers more flexibility in letting multiple cores
and hyperthreads access different areas of memory at once.
Q:
How can maximum performance be achieved realistically?
A:
With the help of the OS, a program could allocate its in-memory data sets
across the DIMMs (obviously this is a virtual memory issue)
to avoid the bottleneck of all of its data going through just one channel.
bandwidth graph from the i5-2520M where for some reason,
loading 128-bit values from the L1 cache sequentially into the XMM
registers was running at an astounding 96 GB per second.
Writing into L1 was much slower.
Here is that graph:
After a few calculations, it became clear why this was happening:
1. 96 GB at the processor's peak speed of 3.2 GHz means a transfer speed of 32 bytes i.e. 256 bits per cycle. But the XMM registers are only 128 bits wide...
2. Newer Intel CPUs have YMM registers however, used by Advanced Vector Extensions (AVX) instructions. These are 256 bits wide and are composed of the existing 16 XMM registers with an additional 128 bits per register for a combined 256 bits per register.
3. However notice, my test is not loading YMM registers, it's loading XMM. What the microcode is doing is, per cycle 128 bits is being transferred to each of two separate registers, which in a straightforward hardware design would share the same input wires.
Therefore, Intel has designed the circuitry so that two XMM registers can be loaded in one cycle. It seems likely that either the L1-to-XMM/YMM path has been expanded to 256 bits or reading from L1 is possible at a rate of two per cycle (dual data rate). More like the latter case.
bandwidth or
for instance the genetics program BLAST, the amount of L3 that each core
can effectively use will be its fraction of the total. If Y is the
number of megabytes of L3, this would be Y/X.
This was proven to me recently when a person with a dual-CPU Xeon 2690
system (20 MB L3, 8 cores and 4 channels per CPU)
ran bandwidth on 8 cores out of 16, resulting in
each core effectively having only 5 MB of L3.
If he had been running bandwidth on all cores, obviously
each would effectively have only 2.5 MB of L3 to use.
If one were to use an AMD Opteron with 10 cores and 30 MB of shared L3, the worst case situation would be each core having effectively 3 MB of L3, which is the same as a lowly consumer-grade Core i5.
Thus with the Xeon and Opteron used with a memory-intensive application running on each core, perhaps one's priority should be on: