Bandwidth benchmark TouchWidgets UI lib Diviner big number math x86 instructions ref GIT quick ref GPG quick ref Avoid Ubuntu Android malware risks iOS malware risks OS/X security tips Who blocks Tor Software engineering BASH aliases I.B. pro/con Nutrition Other apps Blog
| Performance of 256-bit division when numbers are in-register vs. in-memory
Revision 7
© by Zack Smith All rights reserved.
## DivinerThis page describes my experiment that compares the performance of two x86_64 assembly language routines that I wrote that perform 256-bit division i.e.bignummeaning big-number division. ## An Experiment## Observations## 1. Division is CPU intensiveBig-number division is processor-intensive and time-consuming, moreso than any processor's built-in division instructions. One common use for bignum division is verifying the primeness of huge number in cryptography.The present experiment is not about factoring primes. In this experiment I focus solely on division, with the goal of differentiating the performance of CPU-intensive operations wherein the data being worked on is entirely in registers versus where it is entirely in memory. ## 2. Optimization on x86It is generally known that storing data in registers allows faster access to said data, therefore work done on that data is faster.Data stored in registers are affected by speculative execution, whereas stack data are not.
But to what degree are operations on register data faster than on-stack?
My benchmark
The case of 256-bit division should provide the best-case scenario to demonstrate any benefit that can be achieved
by storing ## Analysis of the algorithmWhen you learn base-10 long division in school, it becomes clear that long division is laborious. In base-10 division one has to multiply the divisor by a value from 1 to 9, and the correct muliplier has to be determined by trial and error. Binary division is somewhat simpler, as one does not need to multiply the divisor by anything.
The basic algorithm is as follows.
Let us call the number of bits 1. The divisor and dividend must be copied into registers if they will be stored there. The overall complexity of this is O(n).
2.
Determine where the leftmost bits of the dividend and divisor
are located; this is O(n).
All n bits of the divisor must then be shifted left to meet the leftmost bit of the dividend.
This can require up to n-1 shifts (I have optimized my code to do a bytewise shift first.)
Since all n bits must be shifted,
the complexity of this is O(n
3. The core loop is as follows.
This loop might be run up to n times.
The divisor must be subtracted from the dividend.
If there is a borrow result, meaning that the divisor was larger than the dividend,
we must then undo our subtraction.
After each subtraction, the divisor must be shifted right one bit (O(n)),
and the result must be shifted left one bit (O(n)).
If there was no borrow during the subtraction,
we set the lowest bit of the (shifted) result.
The overall complexity of this is O(n 4. The result must be copied to the caller's buffer, which is linear time i.e. O(n). In finer detail: - n/64 quadword copies to fetch dividend.
- n/64 quadword copies to fetch divisor.
- Up to n-1 divisor left-shift operations; each affects n/64 quadwords.
- Up to n core loop operations:
- n/64 quadword subtractions
- Sometimes n/64 quadword copies to undo subtractions
- n/64 quadword shifts for the divisor
- n/64 quadword shifts for the result
- 1 quadword counter decrement
- n/64 quadword copies to store the result.
To summarize,
based on this analysis the performance overall should be O(n ## HypothesisThe more that the data being worked on (dividend, divisor, result) can beput into registers, the faster it should run.
## PredictionsI predict 256-bit division in registers will be roughly twice as fast as in memory.
From my benchmark ## MethodsI wrote two x86 64-bit assembly routines:- divide256_registers
- divide256_memory
^{2}).
I fed divisors of varying sizes into each routine but I used the 128-bit divisor as the key measurement. ## Results
The ratio of registers over memory is 2.81, which is within the range of my 2X to 4X prediction. Hypothesis confirmed. ## DiscussionDivision operations require performing an operation such as a subtraction or shift, which is immediately followed by a similar operation that uses the preceding operation's result. In this way the big numbers are internally interdependent. This fact prevents reordering by the CPU to improve execution speed.
The results of the experiment
confirm that having interdependent data I would note that this experiment is limited to 256 bit (and smaller) numbers because larger numbers do not fit entirely in the registers. ## AddendumI've subsequently extended the in-memory routine to support any number of bits per operand. I set up the Makefile and tests to generate divisions in these sizes:- 1024-bit
- 4096-bit
- 16384-bit
- 65536-bit
^{2}) algorithm, the performance
drops by roughly a factor of 16 for each step.
The any-size routine does support some in-register book-keeping, but it is minor. It may account for the 1/14.2 decrease because the 256-bit is purely in-memory. ## DownloadIt's here. |

© Zack Smith