zsmith.co
A Fast Method for Generating Primes
My iOS Apps

Open source

Documentation

Contact:
1 at zsmith dot co


Revision 5
Copyright © 2012-2013 by .
All rights reserved.

Introduction

After considering the task of generating prime numbers I hit upon an idea for optimizing the process on standard 32- and 64-bit Intel based CPUs.

Two sieves

Let's compare two sieves, which generate a list of prime numbers: The Sieve of Eratosthenes versus mine.

Sieve of Eratosthenes

In the Sieve of Eratosthenes, one first creates a bitmap representing candidate numbers, i.e. those whose primeness is to be determined. Suppose one wants to check 100 candidates for primeness, e.g. 2 to 101. This requires a bitmap of 100 bits. Given the primes that we already know, we flag all multiples of those primes within the array e.g. 2, 4, 6, 8... and 3, 6, 9, 12.... with 1 bits, indicating that these numbers are non-prime.

What's left is a set of unflagged numbers that we need to test for primeness by calling a generic is_prime() function. If we learn of a new prime number, we must flag the numbers in the array that ar multiples of it, e.g. 2n, 3n, 4n... to remove those from consideration. We leave the actual prime number (1n) unflagged.

So we proceed through the array, doing a primeness test on each non-flagged number. If it's prime or non-prime, we update the array. When we reach the end of the array, we know that the remaining non-flagged numbers are in fact primes.

The Sieve of Eratosthenes simply lets you avoid factoring the multiples of known primes, which is a large percentage of the numbers in the array, thereby saving you time.

One disadvantage of Eratosthenes is that you have to allocate a fixed-length array at the start. Even though we are using a bit array, if you want to determine all of the primes among the first 4 billion numbers, you'd need an array that is 0.5 billion bytes and this becomes a resource-hungry endeavor.

Eratosthenes is ultimately limited by the speed of your main memory, because the array is typically far too large to fit in your CPU's caches. If you plan to allocate a bitmap that is larger than your available RAM, you can expect a lot of paging to disk, which will impact greatly on the speed of the program.

Sieve of Smith (my sieve)

My sieve takes some inspiration from the above sieve. Instead of eliminating every single multiple of every prime using a bitmap, I keep counters for the lower primes in the CPU registers. As one proceeds sequentially into higher numbers, I use the counters to skip over multiples of primes, rather than a bitmap.

I originally wrote this in 32-bit x86 assembly language, which provided enough 8-bit registers to skip over most of the multiples of lower primes. When I rewrote this in x86_64 assembly language, I had more registers to use and so the coverage is more complete. To repeat the basic theme: By eliminating multiples of the 12 lowest primes, the vast majority of numbers are removed from consideration and do not need to be checked for primeness.

The most time-consuming part of generating prime numbers is always the checking of a number for primeness by dividing known primes into it, because divisions are very compute-intensive. One's goal should always be to avoid doing divisions. By using counters, I can avoid doing most of those divisions.

I am also avoiding the memory-intensive approach of Eratosthenes as well.

The pertinent data for avoiding perhaps 95% of the divisions is held in the fastest memory available: The processor's registers.

The remaining necessary work consists mainly of divisions, done to determine the primeness of a candidate number, which requires dividing each unsigned integer in the array of prime numbers into the candidate number, until a zero remainder is found or not. As program execution continues, that array is always increasing in size because it is, after all, the output array.

Sieve of Smith also uses the common technique in my is_prime(n) routine to check n against primes up to and including ciel(sqrt(n)). Square roots are time-consuming but usually involve only a handful of divisions.

Here is my pseudocode for the Sieve of Smith, checking against the lower 9 primes:

function generate_primes
{
  array primes[] = { 3,5,7,11,13,17,19,23 } 
  array prime_counters[8] = { 0,0,0,0, 0,0,0,0 } 
  let n_primes := 8
  let n := 0
  while (always) {
    ++n
    for i = 0 to 7 do prime_counters[i]++
    let skip := false
    if prime_counters[0] >= 3 {
      prime_counters[0] = 0
      skip = true
    }
    if prime_counters[1] >= 5 {
      prime_counters[1] = 0
      skip = true
    }
    if prime_counters[2] >= 7 {
      prime_counters[2] = 0
      skip = true
    }
    if prime_counters[3] >= 11 {
      prime_counters[3] = 0
      skip = true
    }
    if prime_counters[4] >= 13 {
      prime_counters[4] = 0
      skip = true
    }
    if prime_counters[5] >= 17 {
      prime_counters[5] = 0
      skip = true
    }
    if prime_counters[6] >= 19 {
      prime_counters[6] = 0
      skip = true
    }
    if prime_counters[7] >= 23 {
      prime_counters[7] = 0
      skip = true
    }
    if (skip) continue
    if (n is even) then continue
    if (is_prime (n, primes, n_primes)) {
      primes[n_primes] = n
      multiples[n_primes] = n
      n_primes++
      println ("%Lu is prime", n)
    }
  }
}

Analysis

My sieve avoids testing most numbers for primeness, because it exploits the fact that multiples of the lower primes emcompass most candidate numbers, and certainly more than do multiples of higher primes.

Performance

These numbers assume using my enhanced is_prime function.

On my Core i5 2.4 GHz with DDR3 main memory, running a 32-bit variant of my sieve, 203 million primes were generated in 85 minutes using 32-bit divisions.

On my Core i5 2.4 GHz with DDR3 main memory, 100 million primes were generated in 3284 seconds, or about 55 minutes, using 64-bit divisions.

On an old Celeron 2.8 GHz with slow DDR main memory, 100 million primes required 4792 seconds, or about 80 minutes. This was running my 32-bit C code variant, compiled with gcc -O6, on GNU/Linux, kernel 2.6.

Areas for improvement

It is conceivable that division operations in the is_prime() routine could be made faster by using vector processing i.e. XMM or YMM registers. For instance, while multiples of the lower primes can be easily avoided with operations using the main register set, the next-higher known primes could be divided into the candidate number efficiently using the entire set of XMM registers very quickly. However the precision of this would be limited by the number of significant bits available per number in those registers.

Download

FAQs

How many primes are there from 0 to (2^31)-1?
203280240