SPO600 Berwout: SPO600 Compiler Optimizations

As we are starting projects shortly, this week there is no lab post. Instead we take a look at some gcc optimization flags. Everyone in the course was tasked with selecting two gcc compiler optimization flags and figuring out what they do.

My first choice is:
floop-strip-mine

DISCLAIMER:
We can not run the -floop-strip-mine flag right now because it requires a specific library to be built in to gcc at build time. This library is not present in most readily available gcc builds(including the one you probably have installed and the one I have installed). I have tried building gcc myself, but the process is fairly complicated and I could not get it to work. This means that this blog I will be showing a manual example of what some sample code would look like before and after translation, but there will be no assembly demonstrating the power of the actual flag.

the f before the options stands for flag, you can turn them on by adding the function as an argument to a compilation, for example:

gcc -floop-strip-mine program.C

Some optimizations are always on, some options, such as -O2, -O3 turn on entire groups of optimization flags. To turn off a specific flag use the -no prefix. for example:

gcc -no-floop-strip-mine program.C

The reason I chose floop-strip-mine is that our projects could potentially benefit a lot from multi-threading if they do not take advantage of that yet. floop-strip-mine is an optimization that is especially beneficial in a multi-threaded situation.

floop-strip-mine is a flag that turns on the strip-mining of loops. To see how that works, here is a pseudocode example:

Lets say we have two two-dimensional arrays, array a and b, that both have 8 rows and 8 columns.
Let us assume they are filled with values, what those values are doesn't really matter right now. We are going to loop through the arrays and do a multiplication, storing the result in a two-dimensional array c.

int[8][8] a, b, c;

for(int i=0; i<8; i++){
    for(int j=0;j<8;j++){
        c[i][j] = a[i][j] * b[j][i];
    }
}

The values in a and b are stored in main memory, if we pull them out of main memory into the cache, let's see what happens. Let's assume our cache-lines can store 4 integer values at a time.

The first iteration:
i = 0;
j = 0;
elements we need: a[0][0], b[0][0].
lines going from main memory into cache:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

The second iteration:
i = 0;
j = 1;
elements we need: a[0][1], b[1][0]
Hey that is nice! a[0][1] is already in cache so we can reuse that cache line!
lines going from main memory into cache:
line 3: b[1][0], b[1][1], b[1][2], b[1][3]
Sadly we had to pull in a new line for b, as the value was not in cache yet.

The third iteration:
i = 0;
j = 2;
elements we need: a[0][2], b[2][0]
As in the second iteration the a-line is already in cache, so we can reuse it, the b line however we have to pull in:
line 4: b[2][0], b[2][1], b[2][2], b[2][3]

The fourth iteration:
i=0;
j=3;
elements we need: a[0][3], b[3][0]
As in previous iterations, a is in cache and we pull in:
line 5: b[3][0], b[3][1], b[3][2], b[3][3]

The fifth iteration:
i=0;
j=4;
elements we need: a[0][4], b[4][0]
Now we need to also bring in a new line for a, note that we have been using the same line in all the previous iterations though:
line 6: a[0][4], a[0][5], a[0][6], a[0][7]
line 7: b[4][0], b[4][1], b[4][2], b[4][3]

We can sort of see a pattern here: a is nicely lined up in cache, if we proceed after the 4th iteration it will still only take up 2 lines of cache space. The b-array however is misaligned, every new iteration, we need to bring a new cache-line in from main memory! Note that this is still not a problem in a realistic situation where we have more than enough cache-lines to store all the values in b.

To illustrate what strip mining is and why it works let us assume we only have 8 cache lines to store the values of a and b in. Whenever a new cache line is brought in that does not fit in the cache, the oldest unused cache line gets thrown out.

In the next few iterations I am hoping to make clear what goes wrong here:
The sixth iteration:
i=0;
j=5;
elements we need a[0][5], b[5][0]
cache lines to pull in:
line 8: b[5][0], b[5][1], b[5][2], b[5][3]
Note that our cache is now full! We will start replacing the oldest unused line in the next iterations!

The seventh iteration:
i=0;
j=6;
elements we need a[0][6], b[6][0]
We find the oldest unused cache line, which is line 2, since we used line 1 up to the fourth iteration, cache lines to pull in:
replace line 2: b[6][0], b[6][1], b[6][2], b[6][3]

The eight iteration:
i=0;
j=7;
elements we need: a[0][7], b[7][0]
We find the oldest unused cache line, which is line 3 and replace it:
replace line 3: b[7][0], b[7][1], b[7][2], b[7][3]

This was the final iteration where i remained constant, so far we have not encountered any real problems, besides having to replace some cache lines, which is not a big deal yet. Now we look at the first few iterations where i=1;

The ninth iteration:
i=1;
j=0;
elements we need:a[1][0], b[1][1]
First we find the oldest unused cache line, which is line 4, replace that and then we find the second oldest unused cache line, which is line 1 and replace it:
replace line 4: a[1][0], a[1][1], a[1][2], a[1][3]
replace line 1: b[0][1], b[0][2], b[0][3], b[0][4]
Note here that we used to have the b-value b[0][1] in cache already, but we had to replace it with another value because our cache was full! This means we had to pull the same value from memory into cache twice!

The tenth iteration:
i=1;
j=1;
elements we need: a[1][1], b[1][1]
Again, we see the recurring pattern, we already have a[1][1] in the cache in line 4, but we have to replace a cache line to get the b-value in cache. We already had the b-value b[1][1] in cache in the second iteration, but we were forced to replace that line because our cache was full!

The pattern we are seeing here is that because the access of the b-array is structured as b[j][i] instead of b[i][j], we are running into trouble fitting the values into cache properly and creating cache-misses by replacing lines with values we still need later on!

Normally what you would do in a situation like this where the reverse access b[j][i] is causing the cache-misses is called loop-interchange. You can reverse the order of the loops so b[j][i] essentially becomes b[i][j]. However we can not do that in this case, because a[i][j] is also in this loop! Reversing the loops would create the exact same problem, only on the other side, after swapping the loops around we would have a lot of cache-misses on the a-array!

This is where strip mining comes in!
We take the code we had written for our original program and turn it into this:

int[8][8] a, b, c;

for(int x=0; x<8; x+=4){
for(int y=0; y<8; y+=4){
for(int i=0; i<min(x+4, 8), i++){
for(int j=0; j<min(y+4, 8), j++){
                c[i][j] = a[i][j]*b[j][i];
           }
        }
    }
}

Now I admit that if you look at this the first time, programmatically it does not make a lot of sense. You are introducing two extra for loops and the loops look strange! However from a cache-viewpoint, this makes a lot of sense.

Here is what the iterations in this loop look like:
1st:
x=0, y=0, i=0; j=0
cache lines:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

2nd:
x=0, y=0, i=0; j=1
cache lines:
line 3: b[1][0],b[1][1],b[1][2],b[1][3]

3d:
x=0, y=0, i=0; j=2
cache lines:
line 4: b[2][0],b[2][1],b[2][2],b[2][3]

4th:
x=0, y=0, i=0; j=3
cache lines:
line 5: b[3][0],b[3][1],b[3][2],b[3][3]

5th:
j is no longer smaller than 4 (y+4) so instead we go up one loop and increase i!
x=0, y=0, i=1, j=0
cache lines:
line 6: a[1][0], a[1][1], a[1][2], a[1][3]
Note that here we still have the required b cache line!

6th:
x=0, y=0, i=1,j=1
cache lines:
No cache lines required! We have line 6 which contains a[1][1] and line 3 which contains b[1][1]!

7th:
x=0, y=0, i=1,j=2
cache lines:
No cache lines required! We have line 6 which contains a[1][2] and line 4 which contains b[2][1]!

Actually if you keep on iterating the loop, you can see that for the first 16 iterations, all we need is 8 cache lines! The lines we are missing now are a[2][0], a[2][1], a[2][2], a[2][3] and a[3][0], a[3][1], a[3][2], a[3][3], but that is it for the first 16 iterations! In fact, all groups of 16 iterations in the cycle only require 8 cache lines to store that particular block!

If we look at the arrays in a visual display, we are actually structuring the loops so it does these blocks of 4 by 4 first, before moving on to the next block! The reason for this is that these blocks happen to fit perfectly in our limited cache size.

In the real world obviously we have larger cache lines and more of them, but this concept applies in exactly the same way to arrays that are much larger than this one. This means that if you are working with very large two-dimensional arrays and traversing those arrays in a manner that is sub-optimal, strip-mining may be the solution for you!

Another time when strip-mining can be very valuable is if you are dealing with a multi-threading situation. Often when multi-threading, you will divide the data up into chunks that will be given to a specific processor. All the processors then run the same program, but on different pieces of data. This is a common way of managing workloads between multiple processors.

A problem that arises from this is cache-coherency between processors. If one processor has a cache line that contains a value that another processor has inside it's cache. and it updates a value on that same line. That whole line will need to be updated in the second processor, even though the second processor may not even need to use the specific value that was updated.

So if processor 1 was working on a[0][1], a[0][2] and processor 2 was working on a[0][3], a[0][4],
the cache-line that processor 1 is working on accidentally also contains a[0][3] and a[0][4] because we can only pull in entire cache-lines at a time. This means that if processor 2 updates a[0][3], processor 1 will require an update, even though it is not even actively working on a[0][3]! It just happened to be on the same cache-line. There is a lot of unnecessary communication between processors this way.

To prevent this unwarranted communication we strip-mine the loop and instead deal out the blocks of 16 values to different processors. This way there is no overlap in the cache-lines and each processor stays out of the other processors way. If you wish to know more about this problem, which is a very common problem in multi-processing, google for false-sharing.

So to recap:
We strip-mine because:
a. If our cache size is small, it prevents a lot of cache-misses.
b. If we are multi-threading, it prevents a lot of communication overhead caused by false sharing.

I am a little bit sad that I could not present you with a view of what gcc does with our code in terms of assembly, I will look into obtaining a gcc build with the necessary libraries and post some more on that if I find one.

SPO600 Berwout

maandag 21 maart 2016

SPO600 Compiler Optimizations

Geen opmerkingen:

Een reactie posten

Blogarchief