Other apps
| Performance of 256-bit, 512-bit, 1024-bit and 2048-bit unsigned division on 64-bit Intel x86
Revision 6
© by Zack Smith All rights reserved.
DivinerThis page describes my experiment that compares 256-bit, 512-bit, 1024-bit, and 2048-bit division performance. It uses my big-integer unsigned division routines, which I coded in 64-bit x86 assembly language, and which collectively comprise my benchmark calledDiviner .
Observations1. Division is CPU intensiveBig-integer division is processor-intensive and time-consuming, far moreso than any processor's divide instructions. One common use for division is the factoring of large numbers into primes.In 2002 a private company claimed that 512-bit encryption keys were factorable in about 6 weeks. This presumably involved a large-scale parallel processing effort doing billions of divisions, and complex mathematical models to reduce the workload. Geek.com article. In 2015, Matt Green provided a new estimate of 512-bit factoring: 7.5 hours, using Amazon EC2. Matthew Green blog. The present experiment is not about factoring primes. In this experiment I focus solely on division, and I use only my humble 2.6 GHz Core i5 Haswell laptop computer. 2. Optimization on x86It is generally known that storing data in registers allows faster access to said data and therefore faster performance of work done on that data.Data stored in registers are affected by speculative execution, whereas stack data are not. But to what degree are operations on register data faster than on-stack? And in what contexts? And what about vector registers, which can be used as temporary storage?
My benchmark The case of 256-bit division should provide the best-case scenario to demonstrate what if any benefit can be achieved by storing interdependent data in registers. The 64-bit register set of the x86 class of CPUs has enough room for a 256-bit divisor, 256-bit dividend, 256-bit result, and loop counter. The register set has almost enough room for two 512 bit values, because it has 15 available 64-bit registers (RSP being off-limits). My 512-bit division routine stores the entire dividend and most of the divisor in the registers. 1024-bit division can only store most of the dividend in the registers, and none of the divisor or result. My 1024-bit division routine stores most of the dividend in the registers. Analysis of the algorithmWhen you learn base-10 long division in school, it becomes clear that long division is laborious. In base-10 division one has to multiply the divisor by a value from 1 to 9, and the correct muliplier has to be determined by trial and error. Binary division is somewhat simpler, as one does not need to multiply the divisor by anything.The basic algorithm is as follows. Let us call the number of bits n: 1. The divisor and dividend must be copied into registers and/or onto the stack. The overall complexity of this is O(n). 2. Determine where the leftmost bits of the dividend and divisor are located; this is O(n). All n bits of the divisor must then be shifted left to meet the leftmost bit of the dividend. This can require up to n-1 shifts (I have optimized my code to do a bytewise shift first.) Since all n bits must be shifted, the complexity of this is O(n^{2}). 3. The core loop is as follows. This loop might be run up to n times. The divisor must be subtracted from the dividend. If there is a borrow result, meaning that the divisor was larger than the dividend, we must then undo our subtraction. After each subtraction, the divisor must be shifted right one bit (O(n)), and the result must be shifted left one bit (O(n)). If there was no borrow during the subtraction, we set the lowest bit of the (shifted) result. The overall complexity of this is O(n^{2}). 4. The result must be copied to the caller's buffer, which is linear time i.e. O(n). In finer detail:
To summarize, based on this analysis the performance overall should be O(n^{2}), where n is the number of bits. HypothesisThe fact that the above division algorithm is O(n^{2}) is not in dispute. It's actually O(n^{2}) in two places: The original left-shift of the divisor, and the more substantial division loop.As I increase the value of n, I should see a drop in performance that proves the O(n^{2}) complexity rating. For instance, a doubling of n should result in a reduction of performance by at least 1/4. Again that is not the interesting part.
What is untested is whether this maxim is universally true:
The more that the data being worked on can be put into registers, the faster it should run.
From my benchmark
To determine the efficacy of using register variables over stack variables,
I'll use the specific comparison of my four routines
PredictionsI predict 256-bit division will be at least 4 times faster than 512-bit, 512-bit division will be at least 4 times faster than 1024-bit, and 1024-bit division will be at least 4 times faster than 2048-bit.It's impossible to make an exact prediction of how much faster than 4X any routine will be than the next larger size. It is an unknown firstly, and some CPUs may be faster than others. In theory there should be less and less of an advantage beyond 4X as n increases, because less and less of the data can fit into the 64-bit registers. Methods and resultsI wrote an x86 64-bit assembly routine calleddivide512 , and converted that to support
256-bit, 1024-bit and 2048-bit unsigned division.
For this experiment I used dividends of size n bits, and divisors of size n/2 bits. In actuality, Diviner is able to vary the size of the divisor from 8 to n and the divisions-per-second vary greatly within that range. The n/2 size is the middle of the range. I improved several of the routines to use the XMM registers as temporary storage, specifically to undo the subtraction operation. This improved performance in each case. Results were as follows:
The ratio from 256 to 512 is 4.71.
In 2 out of 3 cases, the routine with more data in registers was somewhat better performing than the next-larger-integer routine with less data in registers compared to what the O(n^{2}) complexity would suggest, but not to a dramatic degree. DiscussionDivision operations require performing an operation such as a subtraction or shift, which is immediately followed by a similar operation that uses the preceding operation's result. In this way the data are interdependent. This fact prevents much instruction reordering by the CPU to improve execution speed.The results of the experiment confirm that having interdependent data in registers results in better execution times than having them on the stack. But the amount of speed-up due to register variables was not huge, no doubt in large part because of the fact that each individual operation (addition, subtraction, shifts) usually relies on the result of the immediately preceding operation of the same type. This guarantees poor performance. I was able to sometimes interlace ALU operations with XMM move operations, which helped to improve performance by letting the CPU do two operations at once, if it chose to. This helped to a greater degree in the larger-integer routines. DownloadIt's here. |