dinsdag 29 maart 2016

Compiler optimizations part 2

Last week we each chose 2 compiler optimizations and in my previous blog I have explained my first, my second compiler optimization flag is -free. In some 64 bit architectures when you load a word into the lower half of a register, the upper half of that register implicitly gets zeroed out. This means that if you need to lengthen an unsigned integer from 32 to 64 bits you can do so just by loading the word.

From the gcc manual we find that -free is a flag that tries to remove redundant extension instructions. This is also one of the optimizations that specifically works well on AArch64, which is why I wanted to take a closer look at it. It is enabled at -O2, -O3 and -Os.

Let me start by saying that even though I've tried multiple programs lengthening int16's and int32's to int64's, by either multiplication or addition, I have yet to find a case where -free kicks in. So far there has been no difference in assembly. I have even tried to get 32 bit floats converted to 64 bit integers to see if it would do lengthening, but it never did. Instead gcc did do something interesting though, which I would like to highlight for a second. This was the program I used to attempt to generate a lengthening instruction.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

uint64_t sumArray(uint32_t *arr, uint32_t *arr2, int length){
    uint64_t result = 0;
    for(int i=0; i<length; i++){
        result += arr[i] + arr2[i];
    }
    return result;
}

int main(){
    srand(time(NULL));
    const int length=256;
    uint32_t arr[length];
    uint32_t arr2[length];
    for(int i=0; i<length; i++){
        arr[i] = rand();
        arr2[i] = rand();
    }
    printf("%d", sumArray(arr, arr2, length));
    return 0;
}
 The funny thing is that this program actually did not create a lengthening instruction, but instead had the following instruction:
mov %eax, %eax
 on the surface this looks very strange, because we are moving a 32 bit word into the same register, but if you look into it slightly deeper, the mov instruction actually zeroes out the top part of the register as well, so it implicitly lengthens the 32 bit unsigned integer to a 64 bit unsigned integer.
I've tried several versions of this program, that multiply or add 32 bit to 64 bit unsigned integers or with 32 and 64 bit unsigned integers, but I have yet to generate a single lengthening instruction. This means that -free sadly does not do anything for the programs that I have written.

I can tell you that if there was a mov instruction present and afterwards a lengthening instruction(for an unsigned 32 bit int), this would be one of the cases where -free would kick in and remove the lengthening instruction, sadly I was not able to generate one of those cases.

Project selection part 2

Alright after the disappointing failure I experienced with slimdata I am ready for a new try! I tried to install several of the leftover projects (most of them have been selected already). Some of them had fairly complicated build instructions to get the last version to work and some of for some of the packages listed I was not sure if they were even around. Notably coreutils latest version has a bunch of prerequisites and I followed their readme to install the most recent version, the prerequisites will not install correctly on it even though I followed their instructions to the letter. Also sha2 and sha3 were sort of ambiguous, it was hard to find the actual source code that was meant by those listings. After exploring these packages I settled on ncompress.

ncompress is a compression algorithm that uses the Lempel-Ziv-Welch approach, this is a well known compression algorithm. The basics are that it creates a lookup-table for common sequences in the file, to reduce the number of bits required to display that sequence. This means there has to be some sort of repetition in the file in order for there to be an improvement. Therefore we can not use a randomly generated file. Luckily I wrote a piece of code in my Project selection 1 blog that writes a repeating int pattern either directly to a file or as text to a file. We can use the output from that program as input for the ncompress program.

The ncompress package is fairly old with its' last release being in 2006, as such I think there should be an optimization present in a multi-processing or simd approach. Which is what I will focus my attention on. This is also because I am most interested in these types of optimizations.

ncompress is easy to install by just running the git command:
git clone https://github.com/vapier/ncompress.git
then it will clone the repo into a map called ncompress, you can then navigate in there and make install the application.

I faced some difficulties copying my files to our aarch64 test servers aarchie and betty, for some reason my scp and sftp kept getting closed by the remote host after a certain percentage. After trying for over 20 times and reaching no more than 20% of a 1 GB file I gave up and looked for an alternative. It turns out rsync is a very flexible command that lets you sync directories across servers. For approaching aarchie and betty you do have to specify a port number with the -e option, but some googling on the rsync command gets you there very quickly.

I then set about obtaining benchmarks for the package, which are to be viewed below:

Some notes about the benchmarks:
- They were taken when cache was "hot" (I disregarded first run).
- They are averages of 3 runs.
- All runs used copies of the same files.
- I used time to measure time spent for the entire command.
-There was a minimal amount of interference caused by other processes running at the same time, so keep that in mind. I did check to see if there were no large CPU consuming applications running.

aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real: 83.652s
user: 82.590s
sys: 1.060s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.811s
user: 15.323s
sys: 0.480s
resulting size:  21505241(~2,00%)

uncompress: 
2545934560(~2.4GB) textfile:
real:14.104s
user: 12.530s
sys: 1.570s
1073741824(1GB) integerstream:
real: 5.920
user: 5.403
sys: 0.510s

my own x86_64 Fedora 23 installation:
compress:
2545934560(~2.4GB) textfile:
real: 39.341s
user: 28.147s
sys: 3.484s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.626s
user: 7.479s
sys: 2.328s
resulting size: 21505241(~2,00%)

uncompress:
2545934560(~2.4GB) textfile:
real: 38.687s
user: 6.326s
sys: 2.160s

1073741824(1GB) integerstream:
real: 14.575s
user: 2.500s
sys: 0.856s

Something interesting to look into immediately is that on Betty the uncompress algorithm seemed to work much better than on my own Fedora installation, where it was pretty much the same speed as the compression algorithm. I would like to do some more benchmarking on actual written text files and maybe some installers later, but for a basic benchmark I think this does alright. I was in some stress for time because I had to select a new project after attempting to benchmark the previous package for too long.

Project Selection part 1

We've started our projects and I have selected slim, a compression algorithm that mainly deals with very large integers of 1,2 or 4-byte word size. One of the things I noticed immediately was that it had not been updated for a while, it was written in 2008 and some additions were made in 2013. Right now it is a one man project as all the contributions were made by a single person.

I knew it was not going to be easy to find documentation because there was only one developer and the last he worked on it was 3 years ago. Undeterred I installed the package on both aarchie and my own Fedora installation. The package can be found here: https://sourceforge.net/projects/slimdata/
The installation was very straightforward and just required a tar -xvf, a ./configure with some parameters to be found in the included README and a make install. I could now slim and unslim files, I tried it on a few text files and it worked. However the program clearly stated it was made for large integer groups with repetitive patterns, so I set out to create a program that would make a file for me. After much experimentation I finished with this:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main() {
    srand(time(NULL));
    //ints in a GB
    int64_t size = 1073741824/4;
    int repetition = 20000;
    int32_t* data = (int32_t*) malloc(sizeof(int32_t)*size);
    int i = 0;
    while (i < size) {
        int32_t number = rand();
        //repeat j times
        for (int j = 0; j < repetition; j++) {
            if (i < size) {
                data[i] = number;
                i++;
            }
        }
    }
    FILE *fp = fopen("data", "w");
    fwrite(data, 4, size, fp);
    //for(int x = 0; x<size; x++){
    //fprintf(fp, "%d", data[x]);
    //}
    fclose(fp);
    printf("success");
    return 0;
}
I have tried using fwrite, to just write the binary values of the integers to a file. I have tried fprintf, to write the text values of the integers to a file. I have tried random repetition amounts and swapping around some integers to create a more or less diverse file. I have tried smaller and larger files. The text files would not even slim, it would literally break the original file into an unrecoverable state. The integer files would slim but the maximum compression file gain was 1,5% of the file or about 20MB on a 1GB file.
I think I can safely assume that this is not the intended gain on a file of this or any size as gzip would win about 50% of the file size in about the same amount of time.

It saddens me to say that I think that I will have to admit defeat here and pick a different project. I will soon update on that.

maandag 21 maart 2016

SPO600 Compiler Optimizations

As we are starting projects shortly, this week there is no lab post. Instead we take a look at some gcc optimization flags. Everyone in the course was tasked with selecting two gcc compiler optimization flags and figuring out what they do.

My first choice is:
floop-strip-mine

DISCLAIMER:
We can not run the -floop-strip-mine flag right now because it requires a specific library to be built in to gcc at build time. This library is not present in most readily available gcc builds(including the one you probably have installed and the one I have installed). I have tried building gcc myself, but the process is fairly complicated and I could not get it to work. This means that this blog I will be showing a manual example of what some sample code would look like before and after translation, but there will be no assembly demonstrating the power of the actual flag.

the f before the options stands for flag, you can turn them on by adding the function as an argument to a compilation, for example:
gcc -floop-strip-mine program.C
Some optimizations are always on, some options, such as -O2, -O3 turn on entire groups of optimization flags. To turn off a specific flag use the -no prefix. for example:

gcc -no-floop-strip-mine program.C
The reason I chose floop-strip-mine is that our projects could potentially benefit a lot from multi-threading if they do not take advantage of that yet. floop-strip-mine is an optimization that is especially beneficial in a multi-threaded situation.

floop-strip-mine is a flag that turns on the strip-mining of loops. To see how that works, here is a pseudocode example:

Lets say we have two two-dimensional arrays, array a and b, that both have 8 rows and 8 columns.
Let us assume they are filled with values, what those values are doesn't really matter right now. We are going to loop through the arrays and do a multiplication, storing the result in a two-dimensional array c.
int[8][8] a, b, c;

for(int i=0; i<8; i++){
    for(int j=0;j<8;j++){
        c[i][j] = a[i][j] * b[j][i];
    }
}

The values in a and b are stored in main memory, if we pull them out of main memory into the cache, let's see what happens. Let's assume our cache-lines can store 4 integer values at a time.

The first iteration:
i = 0;
j = 0;
elements we need: a[0][0], b[0][0].
lines going from main memory into cache:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

The second iteration:
i = 0;
j = 1;
elements we need: a[0][1], b[1][0]
Hey that is nice! a[0][1] is already in cache so we can reuse that cache line!
lines going from main memory into cache:
line 3: b[1][0], b[1][1], b[1][2], b[1][3]
Sadly we had to pull in a new line for b, as the value was not in cache yet.

The third iteration:
i = 0;
j = 2;
elements we need: a[0][2], b[2][0]
As in the second iteration the a-line is already in cache, so we can reuse it, the b line however we have to pull in:
line 4: b[2][0], b[2][1], b[2][2], b[2][3]

The fourth iteration:
i=0;
j=3;
elements we need: a[0][3], b[3][0]
As in previous iterations, a is in cache and we pull in:
line 5: b[3][0], b[3][1], b[3][2], b[3][3]

The fifth iteration:
i=0;
j=4;
elements we need: a[0][4], b[4][0]
Now we need to also bring in a new line for a, note that we have been using the same line in all the previous iterations though:
line 6: a[0][4], a[0][5], a[0][6], a[0][7]
line 7: b[4][0], b[4][1], b[4][2], b[4][3]

We can sort of see a pattern here: a is nicely lined up in cache, if we proceed after the 4th iteration it will still only take up 2 lines of cache space. The b-array however is misaligned, every new iteration, we need to bring a new cache-line in from main memory! Note that this is still not a problem in a realistic situation where we have more than enough cache-lines to store all the values in b.

To illustrate what strip mining is and why it works let us assume we only have 8 cache lines to store the values of a and b in. Whenever a new cache line is brought in that does not fit in the cache, the oldest unused cache line gets thrown out.

In the next few iterations I am hoping to make clear what goes wrong here:
The sixth iteration:
i=0;
j=5;
elements we need a[0][5], b[5][0]
cache lines to pull in:
line 8: b[5][0], b[5][1], b[5][2], b[5][3]
Note that our cache is now full! We will start replacing the oldest unused line in the next iterations!

The seventh iteration:
i=0;
j=6;
elements we need a[0][6], b[6][0]
We find the oldest unused cache line, which is line 2, since we used line 1 up to the fourth iteration, cache lines to pull in:
replace line 2: b[6][0], b[6][1], b[6][2], b[6][3]

The eight iteration:
i=0;
j=7;
elements we need: a[0][7], b[7][0]
We find the oldest unused cache line, which is line 3 and replace it:
replace line 3: b[7][0], b[7][1], b[7][2], b[7][3]

This was the final iteration where i remained constant, so far we have not encountered any real problems, besides having to replace some cache lines, which is not a big deal yet. Now we look at the first few iterations where i=1;

The ninth iteration:
i=1;
j=0;
elements we need:a[1][0], b[1][1]
First we find the oldest unused cache line, which is line 4, replace that and then we find the second oldest unused cache line, which is line 1 and replace it:
replace line 4: a[1][0], a[1][1], a[1][2], a[1][3]
replace line 1: b[0][1], b[0][2], b[0][3], b[0][4]
Note here that we used to have the b-value b[0][1] in cache already, but we had to replace it with another value because our cache was full! This means we had to pull the same value from memory into cache twice!

The tenth iteration:
i=1;
j=1;
elements we need: a[1][1], b[1][1]
Again, we see the recurring pattern, we already have a[1][1] in the cache in line 4, but we have to replace a cache line to get the b-value in cache. We already had the b-value b[1][1] in cache in the second iteration, but we were forced to replace that line because our cache was full!

The pattern we are seeing here is that because the access of the b-array is structured as b[j][i] instead of b[i][j], we are running into trouble fitting the values into cache properly and creating cache-misses by replacing lines with values we still need later on!

Normally what you would do in a situation like this where the reverse access b[j][i] is causing the cache-misses is called loop-interchange. You can reverse the order of the loops so b[j][i] essentially becomes b[i][j]. However we can not do that in this case, because a[i][j] is also in this loop! Reversing the loops would create the exact same problem, only on the other side, after swapping the loops around we would have a lot of cache-misses on the a-array!

This is where strip mining comes in!
We take the code we had written for our original program and turn it into this:
int[8][8] a, b, c;

for(int x=0; x<8; x+=4){
    for(int y=0; y<8; y+=4){
        for(int i=0; i<min(x+4, 8), i++){
            for(int j=0; j<min(y+4, 8), j++){
                c[i][j] = a[i][j]*b[j][i];
           }
        }
    }
}
Now I admit that if you look at this the first time, programmatically it does not make a lot of sense. You are introducing two extra for loops and the loops look strange! However from a cache-viewpoint, this makes a lot of sense.

Here is what the iterations in this loop look like:
1st:
x=0, y=0, i=0; j=0
cache lines:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

2nd:
x=0, y=0, i=0; j=1
cache lines:
line 3: b[1][0],b[1][1],b[1][2],b[1][3]

3d:
x=0, y=0, i=0; j=2
cache lines:
line 4: b[2][0],b[2][1],b[2][2],b[2][3]

4th:
x=0, y=0, i=0; j=3
cache lines:
line 5: b[3][0],b[3][1],b[3][2],b[3][3]

5th:
j is no longer smaller than 4 (y+4) so instead we go up one loop and increase i!
x=0, y=0, i=1, j=0
cache lines:
line 6: a[1][0], a[1][1], a[1][2], a[1][3]
Note that here we still have the required b cache line!

6th:
x=0, y=0, i=1,j=1
cache lines:
No cache lines required! We have line 6 which contains a[1][1] and line 3 which contains b[1][1]!

7th:
x=0, y=0, i=1,j=2
cache lines:
No cache lines required! We have line 6 which contains a[1][2] and line 4 which contains b[2][1]!

Actually if you keep on iterating the loop, you can see that for the first 16 iterations, all we need is 8 cache lines! The lines we are missing now are a[2][0], a[2][1], a[2][2], a[2][3] and a[3][0], a[3][1], a[3][2], a[3][3], but that is it for the first 16 iterations! In fact, all groups of 16 iterations in the cycle only require 8 cache lines to store that particular block!

If we look at the arrays in a visual display, we are actually structuring the loops so it does these blocks of 4 by 4 first, before moving on to the next block! The reason for this is that these blocks happen to fit perfectly in our limited cache size.



In the real world obviously we have larger cache lines and more of them, but this concept applies in exactly the same way to arrays that are much larger than this one. This means that if you are working with very large two-dimensional arrays and traversing those arrays in a manner that is sub-optimal, strip-mining may be the solution for you!

Another time when strip-mining can be very valuable is if you are dealing with a multi-threading situation. Often when multi-threading, you will divide the data up into chunks that will be given to a specific processor. All the processors then run the same program, but on different pieces of data. This is a common way of managing workloads between multiple processors.

A problem that arises from this is cache-coherency between processors. If one processor has a cache line that contains a value that another processor has inside it's cache. and it updates a value on that same line. That whole line will need to be updated in the second processor, even though the second processor may not even need to use the specific value that was updated.

So if processor 1 was working on a[0][1], a[0][2] and processor 2 was working on a[0][3], a[0][4],
the cache-line that processor 1 is working on accidentally also contains a[0][3] and a[0][4] because we can only pull in entire cache-lines at a time. This means that if processor 2 updates a[0][3], processor 1 will require an update, even though it is not even actively working on a[0][3]! It just happened to be on the same cache-line. There is a lot of unnecessary communication between processors this way.

To prevent this unwarranted communication we strip-mine the loop and instead deal out the blocks of 16 values to different processors. This way there is no overlap in the cache-lines and each processor stays out of the other processors way. If you wish to know more about this problem, which is a very common problem in multi-processing, google for false-sharing.

So to recap:
We strip-mine because:
a. If our cache size is small, it prevents a lot of cache-misses.
b. If we are multi-threading, it prevents a lot of communication overhead caused by false sharing.

I am a little bit sad that I could not present you with a view of what gcc does with our code in terms of assembly, I will look into obtaining a gcc build with the necessary libraries and post some more on that if I find one.