Mobile
My iOS apps
Other apps
Open source
  • Bandwidth benchmark
  • RAVM virtual machine
  • Big integer division
  • Prime numbers
  • AntiJOP sanitizer
  • TouchWidgets UI lib
  • Networking utils
  • Documentation
  • x86 instructions ref
  • GIT quick ref
  • GPG quick ref
  • Avoid Ubuntu
  • Android malware risks
  • iOS malware risks
  • OS/X security tips
  • Who blocks Tor
  • Software engineering
  • BASH aliases
  • I.B. pro/con
  • Vocal programming
  • Nutrition
  • Blog
  • Contact
    1 at zsmith dot co

    Performance of 256-bit, 512-bit, 1024-bit and 2048-bit unsigned division on 64-bit Intel x86

    Revision 6
    © by
    All rights reserved.

    Diviner

    This page describes my experiment that compares 256-bit, 512-bit, 1024-bit, and 2048-bit division performance. It uses my big-integer unsigned division routines, which I coded in 64-bit x86 assembly language, and which collectively comprise my benchmark called Diviner.

    Observations

    1. Division is CPU intensive

    Big-integer division is processor-intensive and time-consuming, far moreso than any processor's divide instructions. One common use for division is the factoring of large numbers into primes.

    In 2002 a private company claimed that 512-bit encryption keys were factorable in about 6 weeks. This presumably involved a large-scale parallel processing effort doing billions of divisions, and complex mathematical models to reduce the workload. Geek.com article.

    In 2015, Matt Green provided a new estimate of 512-bit factoring: 7.5 hours, using Amazon EC2. Matthew Green blog.

    The present experiment is not about factoring primes. In this experiment I focus solely on division, and I use only my humble 2.6 GHz Core i5 Haswell laptop computer.

    2. Optimization on x86

    It is generally known that storing data in registers allows faster access to said data and therefore faster performance of work done on that data.

    Data stored in registers are affected by speculative execution, whereas stack data are not.

    But to what degree are operations on register data faster than on-stack? And in what contexts? And what about vector registers, which can be used as temporary storage?

    My benchmark bandwidth shows that on my Core i5, register-to-register MOV instructions are 4 times faster than register-to-vector and vector-to-register MOVQ instructions as well as register-to-stack MOVs. They are also twice as fast as stack-to-register MOVs. However these are move instructions. They don't compete for access to the ALUs and they don't depend on one another's results.

    The case of 256-bit division should provide the best-case scenario to demonstrate what if any benefit can be achieved by storing interdependent data in registers. The 64-bit register set of the x86 class of CPUs has enough room for a 256-bit divisor, 256-bit dividend, 256-bit result, and loop counter.

    The register set has almost enough room for two 512 bit values, because it has 15 available 64-bit registers (RSP being off-limits). My 512-bit division routine stores the entire dividend and most of the divisor in the registers.

    1024-bit division can only store most of the dividend in the registers, and none of the divisor or result. My 1024-bit division routine stores most of the dividend in the registers.

    Analysis of the algorithm

    When you learn base-10 long division in school, it becomes clear that long division is laborious. In base-10 division one has to multiply the divisor by a value from 1 to 9, and the correct muliplier has to be determined by trial and error. Binary division is somewhat simpler, as one does not need to multiply the divisor by anything.

    The basic algorithm is as follows. Let us call the number of bits n:

    1. The divisor and dividend must be copied into registers and/or onto the stack. The overall complexity of this is O(n).

    2. Determine where the leftmost bits of the dividend and divisor are located; this is O(n). All n bits of the divisor must then be shifted left to meet the leftmost bit of the dividend. This can require up to n-1 shifts (I have optimized my code to do a bytewise shift first.) Since all n bits must be shifted, the complexity of this is O(n2).

    3. The core loop is as follows. This loop might be run up to n times. The divisor must be subtracted from the dividend. If there is a borrow result, meaning that the divisor was larger than the dividend, we must then undo our subtraction. After each subtraction, the divisor must be shifted right one bit (O(n)), and the result must be shifted left one bit (O(n)). If there was no borrow during the subtraction, we set the lowest bit of the (shifted) result. The overall complexity of this is O(n2).

    4. The result must be copied to the caller's buffer, which is linear time i.e. O(n).

    In finer detail:

    • n/64 quadword copies to fetch dividend.
    • n/64 quadword copies to fetch divisor.
    • Up to n-1 divisor left-shift operations; each affects n/64 quadwords.
    • Up to n core loop operations:
      • n/64 quadword subtractions
      • Sometimes n/64 quadword copies to undo subtractions
      • n/64 quadword shifts for the divisor
      • n/64 quadword shifts for the result
      • 1 quadword counter decrement
    • n/64 quadword copies to store the result.

    To summarize, based on this analysis the performance overall should be O(n2), where n is the number of bits.

    Hypothesis

    The fact that the above division algorithm is O(n2) is not in dispute. It's actually O(n2) in two places: The original left-shift of the divisor, and the more substantial division loop.

    As I increase the value of n, I should see a drop in performance that proves the O(n2) complexity rating. For instance, a doubling of n should result in a reduction of performance by at least 1/4. Again that is not the interesting part.

    What is untested is whether this maxim is universally true: The more that the data being worked on can be put into registers, the faster it should run. From my benchmark bandwidth the difference in speed between register accesses and the top of stack data (it usually ends up in the L1 cache) is between 2 and 4. But does this hold for the division algorithm? We shall see.

    To determine the efficacy of using register variables over stack variables, I'll use the specific comparison of my four routines divide256, divide512, divide1024, and divide2048.

    • divide256 keeps the dividend, divisor, result and loop counter entirely in the registers.
    • divide512 keeps the dividend and divisor almost entirely in the registers but none of the result.
    • divide1024 keeps only most of the dividend (94%) in the registers and everything else on the stack.
    • divide2048 keeps less than half of the dividend (47%) in the registers and everything else on the stack.

    Predictions

    I predict 256-bit division will be at least 4 times faster than 512-bit, 512-bit division will be at least 4 times faster than 1024-bit, and 1024-bit division will be at least 4 times faster than 2048-bit.

    It's impossible to make an exact prediction of how much faster than 4X any routine will be than the next larger size. It is an unknown firstly, and some CPUs may be faster than others.

    In theory there should be less and less of an advantage beyond 4X as n increases, because less and less of the data can fit into the 64-bit registers.

    Methods and results

    I wrote an x86 64-bit assembly routine called divide512, and converted that to support 256-bit, 1024-bit and 2048-bit unsigned division.

    For this experiment I used dividends of size n bits, and divisors of size n/2 bits. In actuality, Diviner is able to vary the size of the divisor from 8 to n and the divisions-per-second vary greatly within that range. The n/2 size is the middle of the range.

    I improved several of the routines to use the XMM registers as temporary storage, specifically to undo the subtraction operation. This improved performance in each case.

    Results were as follows:

    n divisions per second
    256 1130000
    512 240000
    1024 56800
    2048 11700

    The ratio from 256 to 512 is 4.71.
    The ratio from 512 to 1024 is 4.23.
    The ratio from 1024 to 2048 is 4.85.
    The ratio from 256 to 2048 is 96.6 i.e. two orders of magnitude.

    In 2 out of 3 cases, the routine with more data in registers was somewhat better performing than the next-larger-integer routine with less data in registers compared to what the O(n2) complexity would suggest, but not to a dramatic degree.

    Discussion

    Division operations require performing an operation such as a subtraction or shift, which is immediately followed by a similar operation that uses the preceding operation's result. In this way the data are interdependent. This fact prevents much instruction reordering by the CPU to improve execution speed.

    The results of the experiment confirm that having interdependent data in registers results in better execution times than having them on the stack. But the amount of speed-up due to register variables was not huge, no doubt in large part because of the fact that each individual operation (addition, subtraction, shifts) usually relies on the result of the immediately preceding operation of the same type. This guarantees poor performance.

    I was able to sometimes interlace ALU operations with XMM move operations, which helped to improve performance by letting the CPU do two operations at once, if it chose to. This helped to a greater degree in the larger-integer routines.

    Download

    It's here.



    © Zack Smith