© 2015-2018 by Zack Smith. All rights reserved.
This page describes my experiment that compares the performance of two x86_64 assembly language routines that I wrote that perform 256-bit division i.e.
bignum meaning big-number division.
1. Division is CPU intensive
Big-number division is processor-intensive and time-consuming, moreso than any processor's built-in division instructions. One common use for bignum division is verifying the primeness of huge number in cryptography.
The present experiment is not about factoring primes. In this experiment I focus solely on division, with the goal of differentiating the performance of CPU-intensive operations wherein the data being worked on is entirely in registers versus where it is entirely in memory.
2. Optimization on x86
It is generally known that storing data in registers allows faster access to said data, therefore work done on that data is faster.
Data stored in registers are affected by speculative execution, whereas stack data are not.
But to what degree are operations on register data faster than on-stack? My benchmark
bandwidth showed that on my Core i5, register-to-register move instructions (MOV) are 4 times faster than register-to-stack MOVs. They are also twice as fast as stack-to-register MOVs. However these are move instructions. They don't compete for access to the ALUs and they don't depend on one another's results.
The case of 256-bit division should provide the best-case scenario to demonstrate any benefit that can be achieved by storing interdependent data in registers. The 64-bit register set of the x86 class of CPUs has enough room for a 256-bit divisor, 256-bit dividend, 256-bit result, and loop counter.
Analysis of the algorithm
When you learn base-10 long division in school, it becomes clear that long division is laborious. In base-10 division one has to multiply the divisor by a value from 1 to 9, and the correct muliplier has to be determined by trial and error. Binary division is somewhat simpler, as one does not need to multiply the divisor by anything.
The basic algorithm is as follows. Let us call the number of bits n.
- The divisor and dividend must be copied into registers if they will be stored there. The overall complexity of this is O(n).
- Determine where the leftmost bits of the dividend and divisor are located; this is O(n). All n bits of the divisor must then be shifted left to meet the leftmost bit of the dividend. This can require up to n-1 shifts (I have optimized my code to do a bytewise shift first.) Since all n bits must be shifted, the complexity of this is O(n2). (This can be sped up by moving bigger chunks, then bytes, then shifting bits.)
- The core loop is as follows. This loop might be run up to n times. The divisor must be subtracted from the dividend. If there is a borrow result, meaning that the divisor was larger than the dividend, we must then undo our subtraction. After each subtraction, the divisor must be shifted right one bit (O(n)), and the result must be shifted left one bit (O(n)). If there was no borrow during the subtraction, we set the lowest bit of the (shifted) result. The overall complexity of this is O(n2).
- The result must be copied to the caller's buffer, which is linear time i.e. O(n).
In finer detail:
- n/64 quadword copies to fetch dividend.
- n/64 quadword copies to fetch divisor.
- Up to n-1 divisor left-shift operations; each affects n/64 quadwords.
- Up to n core loop operations:
- n/64 quadword subtractions
- Sometimes n/64 quadword copies to undo subtractions
- n/64 quadword shifts for the divisor
- n/64 quadword shifts for the result
- 1 quadword counter decrement
- n/64 quadword copies to store the result.
To summarize, based on this analysis the performance overall should be O(n2), where n is the number of bits.
The more that the data being worked on (dividend, divisor, result) can be put into registers, the faster it should run.
I predict 256-bit division in registers will be roughly twice as fast as in memory.
From my benchmark
bandwidth the difference in speed between register-to-register moves and moves between registers and the stack is a factor of 2 to 4. But does this hold for the division algorithm? We shall see.
I wrote two x86 64-bit assembly routines:
FWIW, the latter is destructive of the caller's dividend and divisor, whereas the former is not. This should not affect the results because if I were to save/restore the dividend and divisor, it would be O(n) time which is eclipsed by division's O(n2).
I fed divisors of varying sizes into each routine but I used the 128-bit divisor as the key measurement.
|n||divisions per second|
The ratio of registers over memory is 2.81, which is within the range of my 2X to 4X prediction. Hypothesis confirmed.
Division operations require performing an operation such as a subtraction or shift, which is immediately followed by a similar operation that uses the preceding operation's result. In this way the big numbers are internally interdependent. This fact prevents reordering by the CPU to improve execution speed.
The results of the experiment confirm that having interdependent data B(in registers) results in better execution times than having them on in DRAM.
I would note that this experiment is limited to 256 bit (and smaller) numbers because larger numbers do not fit entirely in the registers.
I've subsequently extended the in-memory routine to support any number of bits per operand. I set up the Makefile and tests to generate divisions in these sizes: . itemize()
- 65536-bit . itemize_() Thus the size quadruples in each step. So unsurprisingly for a O(n2) algorithm, the performance drops by roughly a factor of 16 for each step.
|1024-bit (memory)||512||46240||decreases to 1/14.2|
|4096-bit (memory)||2048||2662||decreases to 1/17.4|
|16384-bit (memory)||8192||161.9||decreases to 1/16.4|
|65536-bit (memory)||32768||9.890||decreases to 1/16.4|
The any-size routine does support some in-register book-keeping, but it is minor. It may account for the 1/14.2 decrease because the 256-bit is purely in-memory.