Copyright © 2012-2013 by Zack Smith.
All rights reserved.
IntroductionAfter considering the task of generating prime numbers I hit upon an idea for optimizing the process on standard 32- and 64-bit Intel based CPUs.
Two sievesLet's compare two sieves, which generate a list of prime numbers: The Sieve of Eratosthenes versus mine.
Sieve of EratosthenesIn the Sieve of Eratosthenes, one first creates a bitmap representing candidate numbers, i.e. those whose primeness is to be determined. Suppose one wants to check 100 candidates for primeness, e.g. 2 to 101. This requires a bitmap of 100 bits. Given the primes that we already know, we flag all multiples of those primes within the array e.g. 2, 4, 6, 8... and 3, 6, 9, 12.... with 1 bits, indicating that these numbers are non-prime.
What's left is a set of unflagged numbers that we need to test for primeness by calling a generic is_prime() function. If we learn of a new prime number, we must flag the numbers in the array that ar multiples of it, e.g. 2n, 3n, 4n... to remove those from consideration. We leave the actual prime number (1n) unflagged.
So we proceed through the array, doing a primeness test on each non-flagged number. If it's prime or non-prime, we update the array. When we reach the end of the array, we know that the remaining non-flagged numbers are in fact primes.
The Sieve of Eratosthenes simply lets you avoid factoring the multiples of known primes, which is a large percentage of the numbers in the array, thereby saving you time.
One disadvantage of Eratosthenes is that you have to allocate a fixed-length array at the start. Even though we are using a bit array, if you want to determine all of the primes among the first 4 billion numbers, you'd need an array that is 0.5 billion bytes and this becomes a resource-hungry endeavor.
Eratosthenes is ultimately limited by the speed of your main memory, because the array is typically far too large to fit in your CPU's caches. If you plan to allocate a bitmap that is larger than your available RAM, you can expect a lot of paging to disk, which will impact greatly on the speed of the program.
Sieve of Smith (my sieve)My sieve takes some inspiration from the above sieve. Instead of eliminating every single multiple of every prime using a bitmap, I keep counters for the lower primes in the CPU registers. As one proceeds sequentially into higher numbers, I use the counters to skip over multiples of primes, rather than a bitmap.
I originally wrote this in 32-bit x86 assembly language, which provided enough 8-bit registers to skip over most of the multiples of lower primes. When I rewrote this in x86_64 assembly language, I had more registers to use and so the coverage is more complete. To repeat the basic theme: By eliminating multiples of the 12 lowest primes, the vast majority of numbers are removed from consideration and do not need to be checked for primeness.
The most time-consuming part of generating prime numbers is always the checking of a number for primeness by dividing known primes into it, because divisions are very compute-intensive. One's goal should always be to avoid doing divisions. By using counters, I can avoid doing most of those divisions.
I am also avoiding the memory-intensive approach of Eratosthenes as well.
The pertinent data for avoiding perhaps 95% of the divisions is held in the fastest memory available: The processor's registers.
The remaining necessary work consists mainly of divisions, done to determine the primeness of a candidate number, which requires dividing each unsigned integer in the array of prime numbers into the candidate number, until a zero remainder is found or not. As program execution continues, that array is always increasing in size because it is, after all, the output array.
Sieve of Smith also uses the common technique
Here is my pseudocode for the Sieve of Smith, checking against the lower 9 primes:
AnalysisMy sieve avoids testing most numbers for primeness, because it exploits the fact that multiples of the lower primes emcompass most candidate numbers, and certainly more than do multiples of higher primes.
PerformanceThese numbers assume using my enhanced is_prime function.
On my Core i5 2.4 GHz with DDR3 main memory, running a 32-bit variant of my sieve, 203 million primes were generated in 85 minutes using 32-bit divisions.
On my Core i5 2.4 GHz with DDR3 main memory, 100 million primes were generated in 3284 seconds, or about 55 minutes, using 64-bit divisions.
On an old Celeron 2.8 GHz with slow DDR main memory, 100 million primes required 4792 seconds, or about 80 minutes. This was running my 32-bit C code variant, compiled with gcc -O6, on GNU/Linux, kernel 2.6.
Areas for improvementIt is conceivable that division operations in the is_prime() routine could be made faster by using vector processing i.e. XMM or YMM registers. For instance, while multiples of the lower primes can be easily avoided with operations using the main register set, the next-higher known primes could be divided into the candidate number efficiently using the entire set of XMM registers very quickly. However the precision of this would be limited by the number of significant bits available per number in those registers.