среда, 2 мая 2018 г.

Answer to "Is software prefetching (__builtin_prefetch) useful for performance?"

Recently I've read a short note where an author tells that software prefetch is useless for performance. I think that it would be interesting to listen to another point of view.
First of all let's say a few words about software prefetch itself. Prefetch is an instruction in modern processor that makes possible to load data from given address into cache. That is done before some read/write access to eliminate memory latency during the effective access. Prefetch may be hardware or software. In the current note we will speak about software prefetch.

The first thing that should be said about prefetch is that it mainly should be filled in by a compiler. This proposition grows up from the principal of its work. The success of prefetch optimization depends on the guess about the time when the data is needed by a program. If we insert prefetch too early, the data by the address will not be in the cache. If we insert prefetch too late the given cache line will be useless for us. The exact time of the prefetch instruction strongly depends on the target hardware memory system and it is not very portable.

Even for a compiler it may be difficult to find a good place for a prefetch. So there is a builtin to make available to insert prefetch by hand.

Now let us see a loop that we are going to make faster with prefetch (code on github):

    for(int i = 0; i < ARR_SIZE; i+= unroll)
        res += data[i] * A * B + C - D * E;
        res += data[i + 1] * A * B + C - D * E;

We can see that this is simple loop counting some kind of sum of all elements of an array. It is unrolled with factor 2 by hand to make good effect for prefetch. In normal case this is the work for a compiler to find optimal combination of optimizations with a prefetch instruction. Our prefetch version is:

    int interval = 32;

    for(int i = 0; i < ARR_SIZE; i+= unroll)
        __builtin_prefetch(&data[i + interval], 0, 0);
        res += data[i] * A * B + C - D * E;
        res += data[i + 1] * A * B + C - D * E;

In this version we see a regular prefetch of an element that will be used only after 32 iterations of a loop. This is heuristic value and of course you should not do this in real application. But my aim was to make a simple example were prefetch speed ups a program.

So let us see the geometrical mean of five launches of two versions of loops. The launches are made by the line: gcc regular_load.c && taskset -c 1 ./a.out && gcc -DPREF regular_load.c && taskset -c 1 ./a.out. The mean results:


Let us see what happens with cache:

$ gcc regular_load.c && perf stat -B -e cache-misses,cache-references ./a.out
Simple start
Simple seconds: 3.935150

 Performance counter stats for './a.out':

       111 588 965      cache-misses:u            #   71,735 % of all cache refs   
       155 557 440      cache-references:u                                         

       6,468928214 seconds time elapsed

$ gcc -DPREF regular_load.c && perf stat -B -e cache-misses,cache-references ./a.out
Prefetch start
Prefetch seconds: 3.962791

 Performance counter stats for './a.out':

        71 325 602      cache-misses:u            #   29,053 % of all cache refs   
       245 499 956      cache-references:u                                         

       6,456740799 seconds time elapsed

We can see that prefetch allowed us to decrease cache misses from 71.735% to 29.053%. The performance improvement is 1.36%. We can say this is not a very good result and we will be right. The main problem is that prefetch itself takes time for execution. A good solution for that case will be the asynchronous array access unit as it was implemented in the Elbrus processors.

The other case when prefetch may be useful is loads by non-regular addresses for example in recursive structures as lists or trees. May be I will write about it later.

Комментариев нет:

Отправить комментарий