Quick update:
I got a reply to my github pull request. Sadly it was not what I had hoped for. The package manager replied that they want to give users full control of their compiler optimizations and not force anything on them.
As there is theoretically no downside to applying atleast a base level of optimization, this argument seems a little bit strange to me.
The way I would do it is give the user a base level of optimization, so that users that have no knowledge of compiler optimizations get the benefit in the significant speedup without having to do anything. Advanced users could then opt-out of the optimization if they want to. I realize that this is more work for advanced users. However I think the situation where the user opts out of optimizations would come up very rarely. Whereas the situation where the user should have applied some degree of optimization, but did not because the user didn't know he should have, is more prevalent.
donderdag 21 april 2016
Phase 3 Pushing changes upstream
To get my changes accepted upstream, the first thing I needed to figure out was where in the make / build process to insert my change. Since in the phase 2 optimization testing I only directly edited the Makefile to see the changes.
It turns out ncompress uses a Makefile.def, which contains the definitions to be used for the make call. The package is not built with ./configure like most, but instead, after cloning you can directly make install. This means I had to add the following 2 lines to the Makefile.def:
Next I had to figure out how to contribute to the package. Luckily since the package is available on github I can fork the package. So that is what I did, I forked the package, made the change and committed it to my branch. I then created the following pull request:
https://github.com/vapier/ncompress/pull/7
I've added a link to my blog and a short description in the pull request, since there is no other means of communication described on the project page, now all we can do is wait for a response!
It turns out ncompress uses a Makefile.def, which contains the definitions to be used for the make call. The package is not built with ./configure like most, but instead, after cloning you can directly make install. This means I had to add the following 2 lines to the Makefile.def:
#Compiler optimization flagsI tested whether the package still built correctly with my change to the Makefile.def and it built as it should with the -O3 optimization turned on.
CFLAGS=-O3
Next I had to figure out how to contribute to the package. Luckily since the package is available on github I can fork the package. So that is what I did, I forked the package, made the change and committed it to my branch. I then created the following pull request:
https://github.com/vapier/ncompress/pull/7
I've added a link to my blog and a short description in the pull request, since there is no other means of communication described on the project page, now all we can do is wait for a response!
zondag 10 april 2016
Phase 2 Process
So in my previous blog I found an optimization, but I was not too happy about it because it seemed like such a simple step to take. I guess that is sort of the point with compiler optimizations though. The more efficiently you can get things done, the better. The compiler assists with this very handily.
I tried to get the compiler to tell me some possible optimizations as well, using -fopt-info-vec-missed to see what vectorization possibilities it missed and why. The compiler came up with about 500 lines of missed vectorization, so I wasn't sure where to start with that. I then used -fopt-info-vec to see that some things did actually get vectorized, but that got me no further either.
I tried profiling the program using gprof, it resulted in me finding out that the program spends its entire time in a single large function, whether compressing or uncompressing. This brought me no closer to finding any other optimization.
I have fiddled with the DUSERMEM and DREGISTERS to see if it makes a difference, but I think there is a set maximum to these values somewhere, although I do not know why. To me they are sort of magic numbers right now.
All in all the search and constant waiting for benchmarks was frustrating to me, especially because I found next to nothing besides the obvious compiler flag. If I was to attempt a project like this again. I would have an extra phase and my phase 2 would be write an automatic benchmarking suite where you press a button and a reliable benchmark rolls out the other end.
I also wouldn't use my own laptop to benchmark on, because for some reason the Virtual Machine gives very varying performances and I can never tell if it is an accurate representation of change or not. Luckily betty had my sanity covered there.
I tried to get the compiler to tell me some possible optimizations as well, using -fopt-info-vec-missed to see what vectorization possibilities it missed and why. The compiler came up with about 500 lines of missed vectorization, so I wasn't sure where to start with that. I then used -fopt-info-vec to see that some things did actually get vectorized, but that got me no further either.
I tried profiling the program using gprof, it resulted in me finding out that the program spends its entire time in a single large function, whether compressing or uncompressing. This brought me no closer to finding any other optimization.
I have fiddled with the DUSERMEM and DREGISTERS to see if it makes a difference, but I think there is a set maximum to these values somewhere, although I do not know why. To me they are sort of magic numbers right now.
All in all the search and constant waiting for benchmarks was frustrating to me, especially because I found next to nothing besides the obvious compiler flag. If I was to attempt a project like this again. I would have an extra phase and my phase 2 would be write an automatic benchmarking suite where you press a button and a reliable benchmark rolls out the other end.
I also wouldn't use my own laptop to benchmark on, because for some reason the Virtual Machine gives very varying performances and I can never tell if it is an accurate representation of change or not. Luckily betty had my sanity covered there.
Project Phase 2
After last week where we ended our project selection with a somewhat hasty switch to another project I am happy to be able to just dig into this package a little bit. Sadly I have been sick half of this week(still am) so the progress is slower than I would have hoped.
In my project selection blog I mentioned that I wanted to look at vectorization options the compiler can do for us if we make some small changes to the code. However before looking into the code I went to take a look at the makefile to see if there are any vectorization flags currently turned on.
This is what the compiler options line looks like on both betty and my own x86 machine:
The most glaring lack here is that there is not even a base level of optimization applied, so I set out to test that first. To verify that the code has not been changed in a way that it now has a different meaning, we will be using md5 checksum to see if our compressed and uncompressed files remain the same. After slowly ramping up the optimization level, it turns out that the program works on O3, without it causing any problems.
During benchmarks I first encountered a couple of runs where it looked like uncompressing the files was slower than before applying the optimization, however after running some more benchmarks, it seems to me that it was probably background noise from other processes running on my system.
After the -O3 optimization level had been applied we see the following results in the benchmarks:
aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real: 64.467s
user: 63.280s
sys: 1.180s
1073741824(1GB) integerstream:
real: 9.413s
user: 8.980s
sys: 0.430s
uncompress:
2545934560(~2.4GB) textfile:
real: 10.882s
user: 9.510s
sys: 1.370s
1073741824(1GB) integerstream:
real: 4.734s
user: 3.920s
sys: 0.700s
my own x86_64 Fedora 23 installation:
compress:
2545934560(~2.4GB) textfile:
real: 34.812s
user: 18.821s
sys: 4.715s
1073741824(1GB) integerstream:
real: 13.274s
user: 3.789s
sys: 1.651s
uncompress:
2545934560(~2.4GB) textfile:
real: 45.669s
user: 5.779s
sys: 2.339s
1073741824(1GB) integerstream:
real: 17.713s
user: 2.814s
sys: 1.107s
If we compare that with last week's results we find that in all of the cases the compression was faster with the optimization turned on. However oddly enough on my local machine the uncompress was notably slower! This could also be explained by the amount of possible disturbances in the background processes of my machine since I am running a Virtual machine to test with. However I have tested extensively and have yet to see the optimized version of uncompress not do worse.
To go about implementing this optimization in the actual program I will have to add it to the Makefile recipe, for the testing I have just been editing the Makefile itself. I have also been looking for a reason as to why compression on the aarch64 server betty is so terribly slow, while uncompress is very fast, but I have yet to find the answer.
In my project selection blog I mentioned that I wanted to look at vectorization options the compiler can do for us if we make some small changes to the code. However before looking into the code I went to take a look at the makefile to see if there are any vectorization flags currently turned on.
This is what the compiler options line looks like on both betty and my own x86 machine:
options= $(CFLAGS) -DNOFUNCDEF -DUTIME_H $(LDFLAGS) $(CFLAGS) $(CPPFLAGS) -DDIRENT=1 -DUSERMEM=800000 -DREGISTERS=3What I found out is that CFLAGS, CPPFLAGS and LDFLAGS are all undefined in this case. What I also found strange is that these options define a set memory for the program to use and a number of registers the program can use.
The most glaring lack here is that there is not even a base level of optimization applied, so I set out to test that first. To verify that the code has not been changed in a way that it now has a different meaning, we will be using md5 checksum to see if our compressed and uncompressed files remain the same. After slowly ramping up the optimization level, it turns out that the program works on O3, without it causing any problems.
During benchmarks I first encountered a couple of runs where it looked like uncompressing the files was slower than before applying the optimization, however after running some more benchmarks, it seems to me that it was probably background noise from other processes running on my system.
After the -O3 optimization level had been applied we see the following results in the benchmarks:
aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real: 64.467s
user: 63.280s
sys: 1.180s
1073741824(1GB) integerstream:
real: 9.413s
user: 8.980s
sys: 0.430s
uncompress:
2545934560(~2.4GB) textfile:
real: 10.882s
user: 9.510s
sys: 1.370s
1073741824(1GB) integerstream:
real: 4.734s
user: 3.920s
sys: 0.700s
my own x86_64 Fedora 23 installation:
compress:
2545934560(~2.4GB) textfile:
real: 34.812s
user: 18.821s
sys: 4.715s
1073741824(1GB) integerstream:
real: 13.274s
user: 3.789s
sys: 1.651s
uncompress:
2545934560(~2.4GB) textfile:
real: 45.669s
user: 5.779s
sys: 2.339s
1073741824(1GB) integerstream:
real: 17.713s
user: 2.814s
sys: 1.107s
If we compare that with last week's results we find that in all of the cases the compression was faster with the optimization turned on. However oddly enough on my local machine the uncompress was notably slower! This could also be explained by the amount of possible disturbances in the background processes of my machine since I am running a Virtual machine to test with. However I have tested extensively and have yet to see the optimized version of uncompress not do worse.
To go about implementing this optimization in the actual program I will have to add it to the Makefile recipe, for the testing I have just been editing the Makefile itself. I have also been looking for a reason as to why compression on the aarch64 server betty is so terribly slow, while uncompress is very fast, but I have yet to find the answer.
dinsdag 29 maart 2016
Compiler optimizations part 2
Last week we each chose 2 compiler optimizations and in my previous blog I have explained my first, my second compiler optimization flag is -free. In some 64 bit architectures when you load a word into the lower half of a register, the upper half of that register implicitly gets zeroed out. This means that if you need to lengthen an unsigned integer from 32 to 64 bits you can do so just by loading the word.
From the gcc manual we find that -free is a flag that tries to remove redundant extension instructions. This is also one of the optimizations that specifically works well on AArch64, which is why I wanted to take a closer look at it. It is enabled at -O2, -O3 and -Os.
Let me start by saying that even though I've tried multiple programs lengthening int16's and int32's to int64's, by either multiplication or addition, I have yet to find a case where -free kicks in. So far there has been no difference in assembly. I have even tried to get 32 bit floats converted to 64 bit integers to see if it would do lengthening, but it never did. Instead gcc did do something interesting though, which I would like to highlight for a second. This was the program I used to attempt to generate a lengthening instruction.
I've tried several versions of this program, that multiply or add 32 bit to 64 bit unsigned integers or with 32 and 64 bit unsigned integers, but I have yet to generate a single lengthening instruction. This means that -free sadly does not do anything for the programs that I have written.
I can tell you that if there was a mov instruction present and afterwards a lengthening instruction(for an unsigned 32 bit int), this would be one of the cases where -free would kick in and remove the lengthening instruction, sadly I was not able to generate one of those cases.
From the gcc manual we find that -free is a flag that tries to remove redundant extension instructions. This is also one of the optimizations that specifically works well on AArch64, which is why I wanted to take a closer look at it. It is enabled at -O2, -O3 and -Os.
Let me start by saying that even though I've tried multiple programs lengthening int16's and int32's to int64's, by either multiplication or addition, I have yet to find a case where -free kicks in. So far there has been no difference in assembly. I have even tried to get 32 bit floats converted to 64 bit integers to see if it would do lengthening, but it never did. Instead gcc did do something interesting though, which I would like to highlight for a second. This was the program I used to attempt to generate a lengthening instruction.
#include <stdlib.h>The funny thing is that this program actually did not create a lengthening instruction, but instead had the following instruction:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
uint64_t sumArray(uint32_t *arr, uint32_t *arr2, int length){
uint64_t result = 0;
for(int i=0; i<length; i++){
result += arr[i] + arr2[i];
}
return result;
}
int main(){
srand(time(NULL));
const int length=256;
uint32_t arr[length];
uint32_t arr2[length];
for(int i=0; i<length; i++){
arr[i] = rand();
arr2[i] = rand();
}
printf("%d", sumArray(arr, arr2, length));
return 0;
}
mov %eax, %eaxon the surface this looks very strange, because we are moving a 32 bit word into the same register, but if you look into it slightly deeper, the mov instruction actually zeroes out the top part of the register as well, so it implicitly lengthens the 32 bit unsigned integer to a 64 bit unsigned integer.
I've tried several versions of this program, that multiply or add 32 bit to 64 bit unsigned integers or with 32 and 64 bit unsigned integers, but I have yet to generate a single lengthening instruction. This means that -free sadly does not do anything for the programs that I have written.
I can tell you that if there was a mov instruction present and afterwards a lengthening instruction(for an unsigned 32 bit int), this would be one of the cases where -free would kick in and remove the lengthening instruction, sadly I was not able to generate one of those cases.
Project selection part 2
Alright after the disappointing failure I experienced with slimdata I am ready for a new try! I tried to install several of the leftover projects (most of them have been selected already). Some of them had fairly complicated build instructions to get the last version to work and some of for some of the packages listed I was not sure if they were even around. Notably coreutils latest version has a bunch of prerequisites and I followed their readme to install the most recent version, the prerequisites will not install correctly on it even though I followed their instructions to the letter. Also sha2 and sha3 were sort of ambiguous, it was hard to find the actual source code that was meant by those listings. After exploring these packages I settled on ncompress.
ncompress is a compression algorithm that uses the Lempel-Ziv-Welch approach, this is a well known compression algorithm. The basics are that it creates a lookup-table for common sequences in the file, to reduce the number of bits required to display that sequence. This means there has to be some sort of repetition in the file in order for there to be an improvement. Therefore we can not use a randomly generated file. Luckily I wrote a piece of code in my Project selection 1 blog that writes a repeating int pattern either directly to a file or as text to a file. We can use the output from that program as input for the ncompress program.
The ncompress package is fairly old with its' last release being in 2006, as such I think there should be an optimization present in a multi-processing or simd approach. Which is what I will focus my attention on. This is also because I am most interested in these types of optimizations.
ncompress is easy to install by just running the git command:
git clone https://github.com/vapier/ncompress.git
then it will clone the repo into a map called ncompress, you can then navigate in there and make install the application.
I faced some difficulties copying my files to our aarch64 test servers aarchie and betty, for some reason my scp and sftp kept getting closed by the remote host after a certain percentage. After trying for over 20 times and reaching no more than 20% of a 1 GB file I gave up and looked for an alternative. It turns out rsync is a very flexible command that lets you sync directories across servers. For approaching aarchie and betty you do have to specify a port number with the -e option, but some googling on the rsync command gets you there very quickly.
I then set about obtaining benchmarks for the package, which are to be viewed below:
Some notes about the benchmarks:
- They were taken when cache was "hot" (I disregarded first run).
- They are averages of 3 runs.
- All runs used copies of the same files.
- I used time to measure time spent for the entire command.
-There was a minimal amount of interference caused by other processes running at the same time, so keep that in mind. I did check to see if there were no large CPU consuming applications running.
aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real: 83.652s
user: 82.590s
sys: 1.060s
resulting size: 49091023(~1,93%)
1073741824(1GB) integerstream:
real: 15.811s
user: 15.323s
sys: 0.480s
resulting size: 21505241(~2,00%)
uncompress:
2545934560(~2.4GB) textfile:
real:14.104s
user: 12.530s
sys: 1.570s
1073741824(1GB) integerstream:
real: 5.920
user: 5.403
sys: 0.510s
my own x86_64 Fedora 23 installation:
compress:
2545934560(~2.4GB) textfile:
real: 39.341s
user: 28.147s
sys: 3.484s
resulting size: 49091023(~1,93%)
1073741824(1GB) integerstream:
real: 15.626s
user: 7.479s
sys: 2.328s
resulting size: 21505241(~2,00%)
uncompress:
2545934560(~2.4GB) textfile:
real: 38.687s
user: 6.326s
sys: 2.160s
1073741824(1GB) integerstream:
real: 14.575s
user: 2.500s
sys: 0.856s
Something interesting to look into immediately is that on Betty the uncompress algorithm seemed to work much better than on my own Fedora installation, where it was pretty much the same speed as the compression algorithm. I would like to do some more benchmarking on actual written text files and maybe some installers later, but for a basic benchmark I think this does alright. I was in some stress for time because I had to select a new project after attempting to benchmark the previous package for too long.
ncompress is a compression algorithm that uses the Lempel-Ziv-Welch approach, this is a well known compression algorithm. The basics are that it creates a lookup-table for common sequences in the file, to reduce the number of bits required to display that sequence. This means there has to be some sort of repetition in the file in order for there to be an improvement. Therefore we can not use a randomly generated file. Luckily I wrote a piece of code in my Project selection 1 blog that writes a repeating int pattern either directly to a file or as text to a file. We can use the output from that program as input for the ncompress program.
The ncompress package is fairly old with its' last release being in 2006, as such I think there should be an optimization present in a multi-processing or simd approach. Which is what I will focus my attention on. This is also because I am most interested in these types of optimizations.
ncompress is easy to install by just running the git command:
git clone https://github.com/vapier/ncompress.git
then it will clone the repo into a map called ncompress, you can then navigate in there and make install the application.
I faced some difficulties copying my files to our aarch64 test servers aarchie and betty, for some reason my scp and sftp kept getting closed by the remote host after a certain percentage. After trying for over 20 times and reaching no more than 20% of a 1 GB file I gave up and looked for an alternative. It turns out rsync is a very flexible command that lets you sync directories across servers. For approaching aarchie and betty you do have to specify a port number with the -e option, but some googling on the rsync command gets you there very quickly.
I then set about obtaining benchmarks for the package, which are to be viewed below:
Some notes about the benchmarks:
- They were taken when cache was "hot" (I disregarded first run).
- They are averages of 3 runs.
- All runs used copies of the same files.
- I used time to measure time spent for the entire command.
-There was a minimal amount of interference caused by other processes running at the same time, so keep that in mind. I did check to see if there were no large CPU consuming applications running.
aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real: 83.652s
user: 82.590s
sys: 1.060s
resulting size: 49091023(~1,93%)
1073741824(1GB) integerstream:
real: 15.811s
user: 15.323s
sys: 0.480s
resulting size: 21505241(~2,00%)
uncompress:
2545934560(~2.4GB) textfile:
real:14.104s
user: 12.530s
sys: 1.570s
1073741824(1GB) integerstream:
real: 5.920
user: 5.403
sys: 0.510s
my own x86_64 Fedora 23 installation:
compress:
2545934560(~2.4GB) textfile:
real: 39.341s
user: 28.147s
sys: 3.484s
resulting size: 49091023(~1,93%)
1073741824(1GB) integerstream:
real: 15.626s
user: 7.479s
sys: 2.328s
resulting size: 21505241(~2,00%)
uncompress:
2545934560(~2.4GB) textfile:
real: 38.687s
user: 6.326s
sys: 2.160s
1073741824(1GB) integerstream:
real: 14.575s
user: 2.500s
sys: 0.856s
Something interesting to look into immediately is that on Betty the uncompress algorithm seemed to work much better than on my own Fedora installation, where it was pretty much the same speed as the compression algorithm. I would like to do some more benchmarking on actual written text files and maybe some installers later, but for a basic benchmark I think this does alright. I was in some stress for time because I had to select a new project after attempting to benchmark the previous package for too long.
Project Selection part 1
We've started our projects and I have selected slim, a compression algorithm that mainly deals with very large integers of 1,2 or 4-byte word size. One of the things I noticed immediately was that it had not been updated for a while, it was written in 2008 and some additions were made in 2013. Right now it is a one man project as all the contributions were made by a single person.
I knew it was not going to be easy to find documentation because there was only one developer and the last he worked on it was 3 years ago. Undeterred I installed the package on both aarchie and my own Fedora installation. The package can be found here: https://sourceforge.net/projects/slimdata/
The installation was very straightforward and just required a tar -xvf, a ./configure with some parameters to be found in the included README and a make install. I could now slim and unslim files, I tried it on a few text files and it worked. However the program clearly stated it was made for large integer groups with repetitive patterns, so I set out to create a program that would make a file for me. After much experimentation I finished with this:
I think I can safely assume that this is not the intended gain on a file of this or any size as gzip would win about 50% of the file size in about the same amount of time.
It saddens me to say that I think that I will have to admit defeat here and pick a different project. I will soon update on that.
I knew it was not going to be easy to find documentation because there was only one developer and the last he worked on it was 3 years ago. Undeterred I installed the package on both aarchie and my own Fedora installation. The package can be found here: https://sourceforge.net/projects/slimdata/
The installation was very straightforward and just required a tar -xvf, a ./configure with some parameters to be found in the included README and a make install. I could now slim and unslim files, I tried it on a few text files and it worked. However the program clearly stated it was made for large integer groups with repetitive patterns, so I set out to create a program that would make a file for me. After much experimentation I finished with this:
#include <stdint.h>I have tried using fwrite, to just write the binary values of the integers to a file. I have tried fprintf, to write the text values of the integers to a file. I have tried random repetition amounts and swapping around some integers to create a more or less diverse file. I have tried smaller and larger files. The text files would not even slim, it would literally break the original file into an unrecoverable state. The integer files would slim but the maximum compression file gain was 1,5% of the file or about 20MB on a 1GB file.
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main() {
srand(time(NULL));
//ints in a GB
int64_t size = 1073741824/4;
int repetition = 20000;
int32_t* data = (int32_t*) malloc(sizeof(int32_t)*size);
int i = 0;
while (i < size) {
int32_t number = rand();
//repeat j times
for (int j = 0; j < repetition; j++) {
if (i < size) {
data[i] = number;
i++;
}
}
}
FILE *fp = fopen("data", "w");
fwrite(data, 4, size, fp);
//for(int x = 0; x<size; x++){
//fprintf(fp, "%d", data[x]);
//}
fclose(fp);
printf("success");
return 0;
}
I think I can safely assume that this is not the intended gain on a file of this or any size as gzip would win about 50% of the file size in about the same amount of time.
It saddens me to say that I think that I will have to admit defeat here and pick a different project. I will soon update on that.
maandag 21 maart 2016
SPO600 Compiler Optimizations
As we are starting projects shortly, this week there is no lab post. Instead we take a look at some gcc optimization flags. Everyone in the course was tasked with selecting two gcc compiler optimization flags and figuring out what they do.
My first choice is:
floop-strip-mine
DISCLAIMER:
We can not run the -floop-strip-mine flag right now because it requires a specific library to be built in to gcc at build time. This library is not present in most readily available gcc builds(including the one you probably have installed and the one I have installed). I have tried building gcc myself, but the process is fairly complicated and I could not get it to work. This means that this blog I will be showing a manual example of what some sample code would look like before and after translation, but there will be no assembly demonstrating the power of the actual flag.
the f before the options stands for flag, you can turn them on by adding the function as an argument to a compilation, for example:
floop-strip-mine is a flag that turns on the strip-mining of loops. To see how that works, here is a pseudocode example:
Lets say we have two two-dimensional arrays, array a and b, that both have 8 rows and 8 columns.
Let us assume they are filled with values, what those values are doesn't really matter right now. We are going to loop through the arrays and do a multiplication, storing the result in a two-dimensional array c.
The values in a and b are stored in main memory, if we pull them out of main memory into the cache, let's see what happens. Let's assume our cache-lines can store 4 integer values at a time.
The first iteration:
i = 0;
j = 0;
elements we need: a[0][0], b[0][0].
lines going from main memory into cache:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]
The second iteration:
i = 0;
j = 1;
elements we need: a[0][1], b[1][0]
Hey that is nice! a[0][1] is already in cache so we can reuse that cache line!
lines going from main memory into cache:
line 3: b[1][0], b[1][1], b[1][2], b[1][3]
Sadly we had to pull in a new line for b, as the value was not in cache yet.
The third iteration:
i = 0;
j = 2;
elements we need: a[0][2], b[2][0]
As in the second iteration the a-line is already in cache, so we can reuse it, the b line however we have to pull in:
line 4: b[2][0], b[2][1], b[2][2], b[2][3]
The fourth iteration:
i=0;
j=3;
elements we need: a[0][3], b[3][0]
As in previous iterations, a is in cache and we pull in:
line 5: b[3][0], b[3][1], b[3][2], b[3][3]
The fifth iteration:
i=0;
j=4;
elements we need: a[0][4], b[4][0]
Now we need to also bring in a new line for a, note that we have been using the same line in all the previous iterations though:
line 6: a[0][4], a[0][5], a[0][6], a[0][7]
line 7: b[4][0], b[4][1], b[4][2], b[4][3]
We can sort of see a pattern here: a is nicely lined up in cache, if we proceed after the 4th iteration it will still only take up 2 lines of cache space. The b-array however is misaligned, every new iteration, we need to bring a new cache-line in from main memory! Note that this is still not a problem in a realistic situation where we have more than enough cache-lines to store all the values in b.
To illustrate what strip mining is and why it works let us assume we only have 8 cache lines to store the values of a and b in. Whenever a new cache line is brought in that does not fit in the cache, the oldest unused cache line gets thrown out.
In the next few iterations I am hoping to make clear what goes wrong here:
The sixth iteration:
i=0;
j=5;
elements we need a[0][5], b[5][0]
cache lines to pull in:
line 8: b[5][0], b[5][1], b[5][2], b[5][3]
Note that our cache is now full! We will start replacing the oldest unused line in the next iterations!
The seventh iteration:
i=0;
j=6;
elements we need a[0][6], b[6][0]
We find the oldest unused cache line, which is line 2, since we used line 1 up to the fourth iteration, cache lines to pull in:
replace line 2: b[6][0], b[6][1], b[6][2], b[6][3]
The eight iteration:
i=0;
j=7;
elements we need: a[0][7], b[7][0]
We find the oldest unused cache line, which is line 3 and replace it:
replace line 3: b[7][0], b[7][1], b[7][2], b[7][3]
This was the final iteration where i remained constant, so far we have not encountered any real problems, besides having to replace some cache lines, which is not a big deal yet. Now we look at the first few iterations where i=1;
The ninth iteration:
i=1;
j=0;
elements we need:a[1][0], b[1][1]
First we find the oldest unused cache line, which is line 4, replace that and then we find the second oldest unused cache line, which is line 1 and replace it:
replace line 4: a[1][0], a[1][1], a[1][2], a[1][3]
replace line 1: b[0][1], b[0][2], b[0][3], b[0][4]
Note here that we used to have the b-value b[0][1] in cache already, but we had to replace it with another value because our cache was full! This means we had to pull the same value from memory into cache twice!
The tenth iteration:
i=1;
j=1;
elements we need: a[1][1], b[1][1]
Again, we see the recurring pattern, we already have a[1][1] in the cache in line 4, but we have to replace a cache line to get the b-value in cache. We already had the b-value b[1][1] in cache in the second iteration, but we were forced to replace that line because our cache was full!
The pattern we are seeing here is that because the access of the b-array is structured as b[j][i] instead of b[i][j], we are running into trouble fitting the values into cache properly and creating cache-misses by replacing lines with values we still need later on!
Normally what you would do in a situation like this where the reverse access b[j][i] is causing the cache-misses is called loop-interchange. You can reverse the order of the loops so b[j][i] essentially becomes b[i][j]. However we can not do that in this case, because a[i][j] is also in this loop! Reversing the loops would create the exact same problem, only on the other side, after swapping the loops around we would have a lot of cache-misses on the a-array!
This is where strip mining comes in!
We take the code we had written for our original program and turn it into this:
Here is what the iterations in this loop look like:
1st:
x=0, y=0, i=0; j=0
cache lines:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]
2nd:
x=0, y=0, i=0; j=1
cache lines:
line 3: b[1][0],b[1][1],b[1][2],b[1][3]
3d:
x=0, y=0, i=0; j=2
cache lines:
line 4: b[2][0],b[2][1],b[2][2],b[2][3]
4th:
x=0, y=0, i=0; j=3
cache lines:
line 5: b[3][0],b[3][1],b[3][2],b[3][3]
5th:
j is no longer smaller than 4 (y+4) so instead we go up one loop and increase i!
x=0, y=0, i=1, j=0
cache lines:
line 6: a[1][0], a[1][1], a[1][2], a[1][3]
Note that here we still have the required b cache line!
6th:
x=0, y=0, i=1,j=1
cache lines:
No cache lines required! We have line 6 which contains a[1][1] and line 3 which contains b[1][1]!
7th:
x=0, y=0, i=1,j=2
cache lines:
No cache lines required! We have line 6 which contains a[1][2] and line 4 which contains b[2][1]!
Actually if you keep on iterating the loop, you can see that for the first 16 iterations, all we need is 8 cache lines! The lines we are missing now are a[2][0], a[2][1], a[2][2], a[2][3] and a[3][0], a[3][1], a[3][2], a[3][3], but that is it for the first 16 iterations! In fact, all groups of 16 iterations in the cycle only require 8 cache lines to store that particular block!
If we look at the arrays in a visual display, we are actually structuring the loops so it does these blocks of 4 by 4 first, before moving on to the next block! The reason for this is that these blocks happen to fit perfectly in our limited cache size.
In the real world obviously we have larger cache lines and more of them, but this concept applies in exactly the same way to arrays that are much larger than this one. This means that if you are working with very large two-dimensional arrays and traversing those arrays in a manner that is sub-optimal, strip-mining may be the solution for you!
Another time when strip-mining can be very valuable is if you are dealing with a multi-threading situation. Often when multi-threading, you will divide the data up into chunks that will be given to a specific processor. All the processors then run the same program, but on different pieces of data. This is a common way of managing workloads between multiple processors.
A problem that arises from this is cache-coherency between processors. If one processor has a cache line that contains a value that another processor has inside it's cache. and it updates a value on that same line. That whole line will need to be updated in the second processor, even though the second processor may not even need to use the specific value that was updated.
So if processor 1 was working on a[0][1], a[0][2] and processor 2 was working on a[0][3], a[0][4],
the cache-line that processor 1 is working on accidentally also contains a[0][3] and a[0][4] because we can only pull in entire cache-lines at a time. This means that if processor 2 updates a[0][3], processor 1 will require an update, even though it is not even actively working on a[0][3]! It just happened to be on the same cache-line. There is a lot of unnecessary communication between processors this way.
To prevent this unwarranted communication we strip-mine the loop and instead deal out the blocks of 16 values to different processors. This way there is no overlap in the cache-lines and each processor stays out of the other processors way. If you wish to know more about this problem, which is a very common problem in multi-processing, google for false-sharing.
So to recap:
We strip-mine because:
a. If our cache size is small, it prevents a lot of cache-misses.
b. If we are multi-threading, it prevents a lot of communication overhead caused by false sharing.
I am a little bit sad that I could not present you with a view of what gcc does with our code in terms of assembly, I will look into obtaining a gcc build with the necessary libraries and post some more on that if I find one.
My first choice is:
floop-strip-mine
DISCLAIMER:
We can not run the -floop-strip-mine flag right now because it requires a specific library to be built in to gcc at build time. This library is not present in most readily available gcc builds(including the one you probably have installed and the one I have installed). I have tried building gcc myself, but the process is fairly complicated and I could not get it to work. This means that this blog I will be showing a manual example of what some sample code would look like before and after translation, but there will be no assembly demonstrating the power of the actual flag.
the f before the options stands for flag, you can turn them on by adding the function as an argument to a compilation, for example:
gcc -floop-strip-mine program.CSome optimizations are always on, some options, such as -O2, -O3 turn on entire groups of optimization flags. To turn off a specific flag use the -no prefix. for example:
gcc -no-floop-strip-mine program.CThe reason I chose floop-strip-mine is that our projects could potentially benefit a lot from multi-threading if they do not take advantage of that yet. floop-strip-mine is an optimization that is especially beneficial in a multi-threaded situation.
floop-strip-mine is a flag that turns on the strip-mining of loops. To see how that works, here is a pseudocode example:
Lets say we have two two-dimensional arrays, array a and b, that both have 8 rows and 8 columns.
Let us assume they are filled with values, what those values are doesn't really matter right now. We are going to loop through the arrays and do a multiplication, storing the result in a two-dimensional array c.
int[8][8] a, b, c;
for(int i=0; i<8; i++){
for(int j=0;j<8;j++){
c[i][j] = a[i][j] * b[j][i];
}
}
The values in a and b are stored in main memory, if we pull them out of main memory into the cache, let's see what happens. Let's assume our cache-lines can store 4 integer values at a time.
The first iteration:
i = 0;
j = 0;
elements we need: a[0][0], b[0][0].
lines going from main memory into cache:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]
The second iteration:
i = 0;
j = 1;
elements we need: a[0][1], b[1][0]
Hey that is nice! a[0][1] is already in cache so we can reuse that cache line!
lines going from main memory into cache:
line 3: b[1][0], b[1][1], b[1][2], b[1][3]
Sadly we had to pull in a new line for b, as the value was not in cache yet.
The third iteration:
i = 0;
j = 2;
elements we need: a[0][2], b[2][0]
As in the second iteration the a-line is already in cache, so we can reuse it, the b line however we have to pull in:
line 4: b[2][0], b[2][1], b[2][2], b[2][3]
The fourth iteration:
i=0;
j=3;
elements we need: a[0][3], b[3][0]
As in previous iterations, a is in cache and we pull in:
line 5: b[3][0], b[3][1], b[3][2], b[3][3]
The fifth iteration:
i=0;
j=4;
elements we need: a[0][4], b[4][0]
Now we need to also bring in a new line for a, note that we have been using the same line in all the previous iterations though:
line 6: a[0][4], a[0][5], a[0][6], a[0][7]
line 7: b[4][0], b[4][1], b[4][2], b[4][3]
We can sort of see a pattern here: a is nicely lined up in cache, if we proceed after the 4th iteration it will still only take up 2 lines of cache space. The b-array however is misaligned, every new iteration, we need to bring a new cache-line in from main memory! Note that this is still not a problem in a realistic situation where we have more than enough cache-lines to store all the values in b.
To illustrate what strip mining is and why it works let us assume we only have 8 cache lines to store the values of a and b in. Whenever a new cache line is brought in that does not fit in the cache, the oldest unused cache line gets thrown out.
In the next few iterations I am hoping to make clear what goes wrong here:
The sixth iteration:
i=0;
j=5;
elements we need a[0][5], b[5][0]
cache lines to pull in:
line 8: b[5][0], b[5][1], b[5][2], b[5][3]
Note that our cache is now full! We will start replacing the oldest unused line in the next iterations!
The seventh iteration:
i=0;
j=6;
elements we need a[0][6], b[6][0]
We find the oldest unused cache line, which is line 2, since we used line 1 up to the fourth iteration, cache lines to pull in:
replace line 2: b[6][0], b[6][1], b[6][2], b[6][3]
The eight iteration:
i=0;
j=7;
elements we need: a[0][7], b[7][0]
We find the oldest unused cache line, which is line 3 and replace it:
replace line 3: b[7][0], b[7][1], b[7][2], b[7][3]
This was the final iteration where i remained constant, so far we have not encountered any real problems, besides having to replace some cache lines, which is not a big deal yet. Now we look at the first few iterations where i=1;
The ninth iteration:
i=1;
j=0;
elements we need:a[1][0], b[1][1]
First we find the oldest unused cache line, which is line 4, replace that and then we find the second oldest unused cache line, which is line 1 and replace it:
replace line 4: a[1][0], a[1][1], a[1][2], a[1][3]
replace line 1: b[0][1], b[0][2], b[0][3], b[0][4]
Note here that we used to have the b-value b[0][1] in cache already, but we had to replace it with another value because our cache was full! This means we had to pull the same value from memory into cache twice!
The tenth iteration:
i=1;
j=1;
elements we need: a[1][1], b[1][1]
Again, we see the recurring pattern, we already have a[1][1] in the cache in line 4, but we have to replace a cache line to get the b-value in cache. We already had the b-value b[1][1] in cache in the second iteration, but we were forced to replace that line because our cache was full!
The pattern we are seeing here is that because the access of the b-array is structured as b[j][i] instead of b[i][j], we are running into trouble fitting the values into cache properly and creating cache-misses by replacing lines with values we still need later on!
Normally what you would do in a situation like this where the reverse access b[j][i] is causing the cache-misses is called loop-interchange. You can reverse the order of the loops so b[j][i] essentially becomes b[i][j]. However we can not do that in this case, because a[i][j] is also in this loop! Reversing the loops would create the exact same problem, only on the other side, after swapping the loops around we would have a lot of cache-misses on the a-array!
This is where strip mining comes in!
We take the code we had written for our original program and turn it into this:
int[8][8] a, b, c;Now I admit that if you look at this the first time, programmatically it does not make a lot of sense. You are introducing two extra for loops and the loops look strange! However from a cache-viewpoint, this makes a lot of sense.
for(int x=0; x<8; x+=4){
for(int y=0; y<8; y+=4){
for(int i=0; i<min(x+4, 8), i++){
for(int j=0; j<min(y+4, 8), j++){
c[i][j] = a[i][j]*b[j][i];
}
}
}
}
Here is what the iterations in this loop look like:
1st:
x=0, y=0, i=0; j=0
cache lines:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]
2nd:
x=0, y=0, i=0; j=1
cache lines:
line 3: b[1][0],b[1][1],b[1][2],b[1][3]
3d:
x=0, y=0, i=0; j=2
cache lines:
line 4: b[2][0],b[2][1],b[2][2],b[2][3]
4th:
x=0, y=0, i=0; j=3
cache lines:
line 5: b[3][0],b[3][1],b[3][2],b[3][3]
5th:
j is no longer smaller than 4 (y+4) so instead we go up one loop and increase i!
x=0, y=0, i=1, j=0
cache lines:
line 6: a[1][0], a[1][1], a[1][2], a[1][3]
Note that here we still have the required b cache line!
6th:
x=0, y=0, i=1,j=1
cache lines:
No cache lines required! We have line 6 which contains a[1][1] and line 3 which contains b[1][1]!
7th:
x=0, y=0, i=1,j=2
cache lines:
No cache lines required! We have line 6 which contains a[1][2] and line 4 which contains b[2][1]!
Actually if you keep on iterating the loop, you can see that for the first 16 iterations, all we need is 8 cache lines! The lines we are missing now are a[2][0], a[2][1], a[2][2], a[2][3] and a[3][0], a[3][1], a[3][2], a[3][3], but that is it for the first 16 iterations! In fact, all groups of 16 iterations in the cycle only require 8 cache lines to store that particular block!
If we look at the arrays in a visual display, we are actually structuring the loops so it does these blocks of 4 by 4 first, before moving on to the next block! The reason for this is that these blocks happen to fit perfectly in our limited cache size.
In the real world obviously we have larger cache lines and more of them, but this concept applies in exactly the same way to arrays that are much larger than this one. This means that if you are working with very large two-dimensional arrays and traversing those arrays in a manner that is sub-optimal, strip-mining may be the solution for you!
Another time when strip-mining can be very valuable is if you are dealing with a multi-threading situation. Often when multi-threading, you will divide the data up into chunks that will be given to a specific processor. All the processors then run the same program, but on different pieces of data. This is a common way of managing workloads between multiple processors.
A problem that arises from this is cache-coherency between processors. If one processor has a cache line that contains a value that another processor has inside it's cache. and it updates a value on that same line. That whole line will need to be updated in the second processor, even though the second processor may not even need to use the specific value that was updated.
So if processor 1 was working on a[0][1], a[0][2] and processor 2 was working on a[0][3], a[0][4],
the cache-line that processor 1 is working on accidentally also contains a[0][3] and a[0][4] because we can only pull in entire cache-lines at a time. This means that if processor 2 updates a[0][3], processor 1 will require an update, even though it is not even actively working on a[0][3]! It just happened to be on the same cache-line. There is a lot of unnecessary communication between processors this way.
To prevent this unwarranted communication we strip-mine the loop and instead deal out the blocks of 16 values to different processors. This way there is no overlap in the cache-lines and each processor stays out of the other processors way. If you wish to know more about this problem, which is a very common problem in multi-processing, google for false-sharing.
So to recap:
We strip-mine because:
a. If our cache size is small, it prevents a lot of cache-misses.
b. If we are multi-threading, it prevents a lot of communication overhead caused by false sharing.
I am a little bit sad that I could not present you with a view of what gcc does with our code in terms of assembly, I will look into obtaining a gcc build with the necessary libraries and post some more on that if I find one.
zondag 28 februari 2016
SPO600 Lab 7
In this lab we are going to look at some software packages that use inline assembler, we are going to look at why it uses it, whether that use is correct and what platforms it is present for.
I will be looking at libmad or MAD, which is an MPEG Audio Decoding package. If you wish to follow along, the source code package is available on sourceforge at https://sourceforge.net/projects/mad/files/libmad/, as in demonstrated in lab 1, we can just extract the tar ball by using the tar -xvf [filename] command. We don't have to make the program to look at the source code. First off I will be using egrep -R "asm | __(asm)__*\(" to recursively search downwards in the map for which programs inside the package contain assembler.
This gives us 3 results, 2 of which are actual inline use of asssembler.
intl/libgnuintl.h.in and audio_jaguar.c
First I will be looking at the intl/libgnuintl.h.in to see if I can figure out what the instruction is there for, the instruction we are looking at is:
On to the audio_jaguar.c where it doesn't concern just the renaming of a function, here we find the following function:
That was it for this library, there was not that much assembly to be found in here. The assembly that was there, was written to increase portability to systems that have naming issues or lack the IEEE 754 standard. I would have liked to have seen an alternative in C that makes sure the latter function does not break on any other platform than PowerPC, but it is probably assumed that any modern system supports the IEEE 754 standard.
I will be looking at libmad or MAD, which is an MPEG Audio Decoding package. If you wish to follow along, the source code package is available on sourceforge at https://sourceforge.net/projects/mad/files/libmad/, as in demonstrated in lab 1, we can just extract the tar ball by using the tar -xvf [filename] command. We don't have to make the program to look at the source code. First off I will be using egrep -R "asm | __(asm)__*\(" to recursively search downwards in the map for which programs inside the package contain assembler.
This gives us 3 results, 2 of which are actual inline use of asssembler.
intl/libgnuintl.h.in and audio_jaguar.c
First I will be looking at the intl/libgnuintl.h.in to see if I can figure out what the instruction is there for, the instruction we are looking at is:
#ifdef _INTL_REDIRECT_ASMAccording to the comments this sets up an alternative prefix to the usual command, to be used if the libintl_ library does not work. The reason this is here, is that due to duplicate names in different libraries glibc 2.0 and newer and Solaris 2.4 overwrite a function used in the libintl_ library. I can only assume that this is a valid reason to rename it.
# define _INTL_ASM(cname) __asm__ (_INTL_ASMNAME (__USER_LABEL_PREFIX__, #cname))
On to the audio_jaguar.c where it doesn't concern just the renaming of a function, here we find the following function:
/*Here the assembly instruction is used to count the leading zero's in a register, used to convert a floating point value to the IEEE Standard 754(a fixed point representation). Note that the code offers no alternatives and is only valid on PowerPC. I think this is because most modern processors support this IEEE standard, therefore they do not need a conversion function. In my opinion it is a little bit sloppy to allow the function to break if ran on any other processor, interestingly enough this code was written to increase the portability of the library. The standard was absent on this specific platform so they wrote an alternative implementation.
* NAME: float32()
* DESCRIPTION: store 32-bit IEEE Standard 754 floating point representation
*/
static
void float32(void **ptr, mad_fixed_t sample)
{
unsigned long **mem = (void *) ptr, ieee = 0;
if (sample) {
unsigned int zc;
unsigned long abs, s, e, f;
enum {
BIAS = 0x7f
};
/* |seeeeeee|efffffff|ffffffff|ffffffff| */
s = sample & 0x80000000UL;
abs = s ? -sample : sample;
/* PPC count leading zeros */
asm ("cntlzw %0,%1" : "=r" (zc) : "r" (abs));
/* calculate exponent */
e = (((32 - MAD_F_FRACBITS - 1) - zc + BIAS) << 23) & 0x7f800000UL;
/* normalize 1.f - round? */
f = ((abs << zc) >> 8) & 0x007fffffUL;
ieee = s | e | f;
}
That was it for this library, there was not that much assembly to be found in here. The assembly that was there, was written to increase portability to systems that have naming issues or lack the IEEE 754 standard. I would have liked to have seen an alternative in C that makes sure the latter function does not break on any other platform than PowerPC, but it is probably assumed that any modern system supports the IEEE 754 standard.
SPO600 Lab 6
For this lab we are writing a simple program to see if we can let the gcc compiler auto vectorization functionality kick in. We have seen in class that auto vectorization does not kick in at optimization levels lower than level three. The program is going to fill two 1000 integer arrays with random numbers, then sum them into a third array. Lastly it is going to add up all the numbers in the third array into a long integer.
Our code looks as follows:
I realize the three loops could have been one loop, but the specific wording in the assignment suggested the operations would have to be done in a sequential order so I implemented it this way on purpose. I suspect that it may also be easier to recognized as vectorizable by the compiler this way, but I will have to test some more on that later.
Note the assignment did not ask for printing the result, but I was worried the compiler might get rid of my code if I didn't do anything with the result and it is nice to have some confirmation that the program worked.
We will compile this code on the arch64 system we used in the previous lab called aarchie in order to have a look at some arch64 vectorization assembly language.
The compiler command we run is:
I am going to look at our code using the objdump -d command, note that because we specified -O3 we gave the compiler permission to toss around our code making the assembly substantially harder to read. Because our code is so garbled up I am going to look for some excerpts showing that auto vectorization kicked in.
The first vectorization I found is:
Here we are loading part of our first and second array in at the top, and using an incremental immediate offset of 16 to jump past the section we just did in the next iteration. (loading and incrementing the offset in one instruction).
We then call the add on the vector registers, storing the result in vector register 0.
The second part we expected to be vectorized was the summation of the third array, that is found in this section:
Something interesting to note here is that I am not sure why the compiler is lengthening the values to 64 bit signed integer, when a long should require only 32 bits. The compiler is also not making use of the vector reduce operation which would be the logical operation to use in this situation. That last part may be explained by the choice for 64 bit integers, because reducing 2 values is probably less efficient than vector addition. Furthermore, the compiler did all this vectorization without us providing it any hints so far, I will try to see if I can fix the second part with some hints in the code later on. I am glad that it was reasonably easy to find the vector instructions even with the O3 optimization level.
Lastly we were asked to look for a way to vectorize our sound volume program from last lab. Vector registers don't allow for operations across multiple datatypes so we have to think of something to convert our 32 bit float.
In the arm overview I found the following instruction:
We can then promote our 16 bit signed integer samples to 32 bit signed integers using the vectorized lengthening instructions:
That's it for now! I will revisit some of this later due to multiple things I am not 100% sure about at the moment.
Our code looks as follows:
#include <stdlib.h>A couple of notes about this code:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
int main() {
srand(time(NULL));
int arraySize = 1000;
int16_t* a = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
int16_t* b = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
int16_t* c = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
for (int i = 0; i < arraySize; i++) {
a[i] = rand();
b[i] = rand();
}
for (int j = 0; j < arraySize; j++) {
c[j] = a[j] + b[j];
}
long sum = 0;
for (int k = 0; k < arraySize; k++) {
sum += c[k];
}
printf("%d\n", sum);
free(a);
free(b);
free(c);
return 0;
}
I realize the three loops could have been one loop, but the specific wording in the assignment suggested the operations would have to be done in a sequential order so I implemented it this way on purpose. I suspect that it may also be easier to recognized as vectorizable by the compiler this way, but I will have to test some more on that later.
Note the assignment did not ask for printing the result, but I was worried the compiler might get rid of my code if I didn't do anything with the result and it is nice to have some confirmation that the program worked.
We will compile this code on the arch64 system we used in the previous lab called aarchie in order to have a look at some arch64 vectorization assembly language.
The compiler command we run is:
gcc -x c vectorization.C -o vectorization -O3 -std=c99 -gThe -std=99 option is necessary on aarchie because it has the 4.8.3 version of the gcc compiler and c89 is still default there. -O3 is to turn the optimizations we are looking for on. The -x c is to specify we want it compiled in C, for some reason on this code gcc was assuming I wanted to compile it as c++ and giving me back error messages, this solved that problem. -o is to specify the output filename. -g is to specify debug information.
I am going to look at our code using the objdump -d command, note that because we specified -O3 we gave the compiler permission to toss around our code making the assembly substantially harder to read. Because our code is so garbled up I am going to look for some excerpts showing that auto vectorization kicked in.
The first vectorization I found is:
40071c: 4cdf7441 ld1 {v1.8h}, [x2], #16
400720: 4cdf7460 ld1 {v0.8h}, [x3], #16
Here we are loading part of our first and second array in at the top, and using an incremental immediate offset of 16 to jump past the section we just did in the next iteration. (loading and incrementing the offset in one instruction).
400724: 4e608420 add v0.8h, v1.8h, v0.8h
We then call the add on the vector registers, storing the result in vector register 0.
400728: 11000400 add w0, w0, #0x1This add instruction is for the loop control variable. The compiler chose to do this before storing the result so the processor gets some time to finish this operation before calling compare.
40072c: 4c9f7420 st1 {v0.8h}, [x1], #16Here we are storing our result into a memory location pointed to by x1, also using an incremental immediate offset.
400730: 6b04001f cmp w0, w4The compare and branch back to the top if we aren't finished yet. Note that the optional incremental immediate offsets are saving us a lot of instructions otherwise spent in memory management here.
400734: 54ffff43 b.cc 40071c <main+0x13c>
The second part we expected to be vectorized was the summation of the third array, that is found in this section:
4008b8: 4cdf7420 ld1 {v0.8h}, [x1], #16Here we load the array into vector register 0, note it is still in the adress pointed to by x1 where it was stored in the previous section, we are also still using the incremental immediate offset.
4008bc: 0f10a401 sxtl v1.4s, v0.4hAbove we see a somewhat hard to read mixture of additions and lengthening instructions. The add and compare instructions on the regular registers are the loop control variables. I think the reason the addition and lengthening instructions are so intermixed is because the compiler is reusing the original vector register(v0) that already contains the right values as much as possible. This way there is no need to move the originally loaded values around as much.
4008c0: 0f20a423 sxtl v3.2d, v1.2s
4008c4: 4f10a400 sxtl2 v0.4s, v0.8h
4008c8: 4f20a421 sxtl2 v1.2d, v1.4s
4008cc: 4ee28462 add v2.2d, v3.2d, v2.2d
4008d0: 4ee28422 add v2.2d, v1.2d, v2.2d
4008d4: 11000400 add w0, w0, #0x1
4008d8: 0f20a401 sxtl v1.2d, v0.2s
4008dc: 4ee28422 add v2.2d, v1.2d, v2.2d
4008e0: 4f20a400 sxtl2 v0.2d, v0.4s
4008e4: 6b00007f cmp w3, w0
4008e8: 4ee28402 add v2.2d, v0.2d, v2.2d
4008ec: 54fffe68 b.hi 4008b8 <main+0x2d8>
Something interesting to note here is that I am not sure why the compiler is lengthening the values to 64 bit signed integer, when a long should require only 32 bits. The compiler is also not making use of the vector reduce operation which would be the logical operation to use in this situation. That last part may be explained by the choice for 64 bit integers, because reducing 2 values is probably less efficient than vector addition. Furthermore, the compiler did all this vectorization without us providing it any hints so far, I will try to see if I can fix the second part with some hints in the code later on. I am glad that it was reasonably easy to find the vector instructions even with the O3 optimization level.
Lastly we were asked to look for a way to vectorize our sound volume program from last lab. Vector registers don't allow for operations across multiple datatypes so we have to think of something to convert our 32 bit float.
In the arm overview I found the following instruction:
FCVTxS Vd.<T>, Vn.<T>This allows us to use vectorization to convert our 32 bit floats into 32 bit signed integers.
We can then promote our 16 bit signed integer samples to 32 bit signed integers using the vectorized lengthening instructions:
Because the data are now the same type and length we can now use the following instruction to multiply the two:SXTL Vd.<Td>, Vn.<Ts>SXTL2 Vd.<Td>, Vn.<Ts>
FMUL Vd.<T>, Vn.<T>, Vm.<Ts>[index]I believe these instructions should lead to a valid vectorized sound multiplying program. Whether it is the most optimal solution I am not sure and I will have to look at this some more later.
That's it for now! I will revisit some of this later due to multiple things I am not 100% sure about at the moment.
maandag 22 februari 2016
SPO600 Lab 5
For this lab we will be looking at 2 different approaches to scaling 16 bit signed integer sound samples with a float to simulate different volumes. We will also be testing these approaches on 2 different machine architectures, one aarch64 and one x86_64 machine.
The first approach will be to naively scale the value and return the result. When scaling a sample size of n this will result in n multiplication instructions.
The second approach is to create a look up table where we store the result of every possible sample at it's index. We populate the look up table whenever the user changes the scaling factor, but during consecutive calls to the function with no change we can just look up the value in the table. With a 16 bit signed integer this results in 65536 multiplication instructions per volume change. Because the cache will probably be too small to hold the entire array, so this approach will be more memory intensive.
To achieve a clear difference we will be attempting to scale 500 million sound samples, this is roughly equal to 1 hour and 35 minutes of samples in 44 kHz stereo sound.
To estimate the time used by the scaling functions we will be using a control program that does the same memory allocation and basic operations as the program would do without the scaling functions. We will be timing these three programs to see which one does best.
The control program looks like this:
The reason we are using the ^ operator and printing a result is to make sure the compiler does not strip our code thinking that it is useless. We force it to use the code in some form of output so it can't get rid of it.
The reason we are multiplying rand() by 2 is that rand() has a maximum value of approximately 32000, we need to represent around 65000 numbers so this is our way of artificially doing so. Note that this creates only even numbers and it still only creates around 32000 distinct values. For our scaling function this does not matter as we just care about having a lot of samples.
For the 2 implementations we add this method signature above:
Lastly we run a comparison running the three programs on both our platforms, we will be using the time command and the real duration in seconds for this comparison:
Some things to note here:
With these things in mind I think we can still say a few things about the results, due to the size of the gap between the solutions. Even without changing the volume, the x86_64 version performs better when naively scaling the sound samples. This probably has to do with memory access speed being the bottleneck in this case. On the aarch64 system however the lookup-table performs much better than the naive scaling. There is a full minute difference, even if you account for us not changing the volume this is a significant gap. This also means that the processing power is the bottleneck in this system.
In both cases we could easily scale the amount of sound we had in a fraction of the time required to listen to it, so there would be no problem with either approach, however if you are planning to scale a huge amount of sound samples this test was definitely worth the effort.
Next lab we will be looking at parallelising some of these operations to see if we can achieve an even faster result!
The first approach will be to naively scale the value and return the result. When scaling a sample size of n this will result in n multiplication instructions.
The second approach is to create a look up table where we store the result of every possible sample at it's index. We populate the look up table whenever the user changes the scaling factor, but during consecutive calls to the function with no change we can just look up the value in the table. With a 16 bit signed integer this results in 65536 multiplication instructions per volume change. Because the cache will probably be too small to hold the entire array, so this approach will be more memory intensive.
To achieve a clear difference we will be attempting to scale 500 million sound samples, this is roughly equal to 1 hour and 35 minutes of samples in 44 kHz stereo sound.
To estimate the time used by the scaling functions we will be using a control program that does the same memory allocation and basic operations as the program would do without the scaling functions. We will be timing these three programs to see which one does best.
The control program looks like this:
#include <stdio.h>some things to note about this code:
#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>
void fillSampleArray(int16_t* arr, int numOfElems);
int main()
{
//creating an array of samples
srand((int)time(NULL));
int numberOfElements = 500000000;
int16_t* sampleArray = (int16_t*)malloc(sizeof(int16_t) * numberOfElements);
fillSampleArray(sampleArray, numberOfElements);
int result = INT16_MIN;
for (int i = 0; i < numberOfElements; i++) {
result = result ^ sampleArray[i];
}
printf("%d", result);
free(sampleArray);
return 0;
}
void fillSampleArray(int16_t* arr, int numOfElems)
{
for (int i = 0; i < numOfElems; i++)
{
arr[i] = ((rand() * 2) % (UINT16_MAX + 1)) + (INT16_MIN);
}
}
The reason we are using the ^ operator and printing a result is to make sure the compiler does not strip our code thinking that it is useless. We force it to use the code in some form of output so it can't get rid of it.
The reason we are multiplying rand() by 2 is that rand() has a maximum value of approximately 32000, we need to represent around 65000 numbers so this is our way of artificially doing so. Note that this creates only even numbers and it still only creates around 32000 distinct values. For our scaling function this does not matter as we just care about having a lot of samples.
For the 2 implementations we add this method signature above:
int16_t volumeScale(int16_t sample, float scale);We have a small change to the main program to use the function, the result is now calculated as follows:
result = result ^ volumeScale(sampleArray[i], 0.750);Our naive solution is fairly simple and looks as you would expect:
int16_t volumeScale(int16_t sample, float scale)The lookup function is a bit more complicated due to the need to save state across multiple iterations and allocation of the array to store the intermediate results.
{
return (int16_t)sample*scale;
}
int16_t lookUpVolumeScale(int16_t sample, float scale)Funny thing here, I am a C# / Java programmer who has never programmed anything in C besides a flashing led, so I did not actually know how to save state inside a function across multiple iterations. It took me 2 hours of googling to figure out that the static keyword does all this for you. Sometimes when you are searching for something considered to be very simple for a C programmer, it can actually be very hard to find a result that explains it.
{
//previous volume
static float pVolume = -1;
//lookup table
static int16_t* luTable=NULL;
if(luTable==NULL){
luTable = (int16_t*)malloc(sizeof(int16_t) * (UINT16_MAX + 1));
}
if(pVolume != scale)
{
for(int i=0; i<UINT16_MAX+1; i++)
{
//INT16_MIN is the offset
luTable[i] = scale*(i+INT16_MIN);
}
pVolume = scale;
}
return luTable[sample + INT16_MAX + 1];
}
Lastly we run a comparison running the three programs on both our platforms, we will be using the time command and the real duration in seconds for this comparison:
Platform | Control | Naive | Lookup |
---|---|---|---|
x86_64(Xerxes) | 10.728 | 12.354 | 19.238 |
aarch64(Aarchie) | 181.366 | 286.152 | 223.343 |
Some things to note here:
- We have not changed the volume, this means that we are assuming our user is going to listen to 1,5 hours of music without ever changing the volume.
- The aarch64 system is compiling with a slightly older version of gcc.
- The time function is probably not the best approach to accurately time these programs.
With these things in mind I think we can still say a few things about the results, due to the size of the gap between the solutions. Even without changing the volume, the x86_64 version performs better when naively scaling the sound samples. This probably has to do with memory access speed being the bottleneck in this case. On the aarch64 system however the lookup-table performs much better than the naive scaling. There is a full minute difference, even if you account for us not changing the volume this is a significant gap. This also means that the processing power is the bottleneck in this system.
In both cases we could easily scale the amount of sound we had in a fraction of the time required to listen to it, so there would be no problem with either approach, however if you are planning to scale a huge amount of sound samples this test was definitely worth the effort.
Next lab we will be looking at parallelising some of these operations to see if we can achieve an even faster result!
zondag 31 januari 2016
SPO600 Lab 4
In our fourth lab we looked at some of the different compiler options and the differences it made in the assembly output. In order to look at the differences we wrote a simple hello world program that was compiled with the different options.
I will go by the options we went over one by one below.
0. Original.
The original command we used was gcc -g -O0 -fno-builtin.
This version is used to compile a file with debugging information, no optimization and without use of builtin functionality.
1. Static.
Adding -static means the compiler pulls in the entire library the function you are using is in. In this case it pulled in the library that contains the printf function we used to print our message. This results in about a 10 times increase in file size in this case. The advantage of using static is that no matter what system you go to, you can guarantee that the functions you are using will be present, since you include the entire library. The disadvantage is the strain on the file size and if you use static a lot, a lot of your programs will have the same libraries, but with different versions. It will be hard to keep track of security flaws and versioning.
2. Not using fno-builtin.
fno-builtin means you are asking the compiler not to use the builtin functionality. In this case the compiler recognizes it has a more efficient way to do the printf function because it is only using 1 argument. The compiler realizes that it can use the builtin function puts instead for more efficiency.
3. Removing -g .
By removing -g from the command we are compiling the file in without debug information, this means the file becomes smaller, but also harder to read. the --source command no longer shows you where the compiler put what part of your code.
4. Adding additional arguments to the printf.
On aarch64 after assigning 7 arguments to different registers, the rest of the arguments get put on the stack through r1. On x86_64 there were only 5 arguments assigned to the registers before putting the rest on the stack. This makes sense because x86_64 has less registers to work with. Interestingly enough, on the aarch64 side the assembly made use of store instead of put to add the arguments to the stack.
5. Moving printf to a separate function called output and calling that from the main.
With this one I thought the compiler would recognise that it is just calling printf and the function did not really have any use, but I was wrong. The assembly called output from the main and in the output was the same section that was previously in the main. If you use arguments to pass the string it also stores some overhead on the stack.
6. Replace O0 with O3.
This instruction asks the compiler to use a lot of optimization(severe), it will actually move some of your code around to do so, so if you compile something with O3 you always have to check whether the functionality of your program remains the same. One of the differences in the output is that instead of returning to the main function after calling the printf function the compiler realized that the printf was the last statement and with an unconditional branch told the program not to bother coming back to the main. Severe optimization also deleted the set up for the stack in this case, because this program makes no use of it.
So this week we saw a small part of what the compiler can do for us and what we can ask it to do. I found it very interesting to finally see what makes it so that any code we write can actually run. I think understanding how the compiler turns our code into assembler can really help us decide what options to use when we want to go for maximum efficiency in our code.
I will go by the options we went over one by one below.
0. Original.
The original command we used was gcc -g -O0 -fno-builtin.
This version is used to compile a file with debugging information, no optimization and without use of builtin functionality.
1. Static.
Adding -static means the compiler pulls in the entire library the function you are using is in. In this case it pulled in the library that contains the printf function we used to print our message. This results in about a 10 times increase in file size in this case. The advantage of using static is that no matter what system you go to, you can guarantee that the functions you are using will be present, since you include the entire library. The disadvantage is the strain on the file size and if you use static a lot, a lot of your programs will have the same libraries, but with different versions. It will be hard to keep track of security flaws and versioning.
2. Not using fno-builtin.
fno-builtin means you are asking the compiler not to use the builtin functionality. In this case the compiler recognizes it has a more efficient way to do the printf function because it is only using 1 argument. The compiler realizes that it can use the builtin function puts instead for more efficiency.
3. Removing -g .
By removing -g from the command we are compiling the file in without debug information, this means the file becomes smaller, but also harder to read. the --source command no longer shows you where the compiler put what part of your code.
4. Adding additional arguments to the printf.
On aarch64 after assigning 7 arguments to different registers, the rest of the arguments get put on the stack through r1. On x86_64 there were only 5 arguments assigned to the registers before putting the rest on the stack. This makes sense because x86_64 has less registers to work with. Interestingly enough, on the aarch64 side the assembly made use of store instead of put to add the arguments to the stack.
5. Moving printf to a separate function called output and calling that from the main.
With this one I thought the compiler would recognise that it is just calling printf and the function did not really have any use, but I was wrong. The assembly called output from the main and in the output was the same section that was previously in the main. If you use arguments to pass the string it also stores some overhead on the stack.
6. Replace O0 with O3.
This instruction asks the compiler to use a lot of optimization(severe), it will actually move some of your code around to do so, so if you compile something with O3 you always have to check whether the functionality of your program remains the same. One of the differences in the output is that instead of returning to the main function after calling the printf function the compiler realized that the printf was the last statement and with an unconditional branch told the program not to bother coming back to the main. Severe optimization also deleted the set up for the stack in this case, because this program makes no use of it.
So this week we saw a small part of what the compiler can do for us and what we can ask it to do. I found it very interesting to finally see what makes it so that any code we write can actually run. I think understanding how the compiler turns our code into assembler can really help us decide what options to use when we want to go for maximum efficiency in our code.
woensdag 27 januari 2016
SPO600 Lab 3
For our third lab we split up in groups to write some assembly, we started by looking at some x86_64 assembler, after which we looked at the differences between that and aarch64 assembler.
Our first assignment was to build a simple loop that loops from 0-9 and displays "Loop: [the current number] \n" of the loop. In order to convert our loop number to an ASCII-character we have to add 48 to the number because this is the offset of characters where the 0 starts in ASCII. We did this by moving 48 into an empty register that doesn't get trampled at the start of the program. Then in the loop we moved the loop number into an empty register, added 48 from the previous register and wrote the result into the rsi register(message location register).
At first we had our string in a .rodata section, this stands for read-only, so when we tried to write into it, nothing happened. We fixed this by removing the ro to spell .data
Next we did not realise that mov moves the full 64 bytes. Even if you only had 1 byte of data.
This caused our number's move to overwrite the newline character at the end of the original string. This is fixed by adding the b suffix to mov and the register you wish to move it from. So it becomes something like: movb %r13b,(msg+6).
On aarch64 moving into the message with an offset was a little bit different, but otherwise the instructions were similar.
Next we changed it so the loop would print from 00-30. This used the div instruction, which takes the rax register and divides it by a chosen register. It stores the result into rax and the remainder into rdx. We divided the loop number by 10, added our offset of 48 to rax and rdx and moved them into the string as we had done in the previous assignment.
One thing to pay attention to: before using the div operation, the rdx register has to be set to 0!
On aarch64 this was similar, but you had to use msub to calculate the remainder.
The last assignment was supressing the first digit when the number was below 10 We did this by checking if the quotient was zero, if it was, we use je(jump if equal) or be(branch if equal) to jump to a label past the mov assignment to move the first digit into the register.
Tips and tricks:
as -g -o filename.o filename.s; ld -o filename filename.o; ./filename compiles and executes a source file in one statement.
Make sure you know whether you are on least significant byte first architecture or most significant byte architecture as it could heavily influence the result of moving a specific byte somewhere.
Copy your files regularly so if you add to a file and you break it, you still have a working copy to try something new on.
Observations about the differences between aarch64 and x86_64 assembler:
- If you need just the remainder on aarch64 you still have to divide, where you get the remainder built in on x86_64
- When moving just a byte on aarch64 you specify a 32 bit-wide register and you work with an offset, whereas on x86_64 you have the option to specify byte sized.
donderdag 14 januari 2016
SPO600 Lab 2
For this weeks lab we will be looking at two Open Source packages from the Free Software Foundation and installing them.
In order to find them I will browse through the GNU project's free software directory at:
http://directory.fsf.org/wiki/GNU
My first choice is GNUstep, it is a graphical editor for Objective-C and it plans to add support for C++ in the future. We will look at the documentation offered through
http://www.gnustep.org/resources/documentation/User/GNUstep/userfaq_toc.html
in order to install the package.
On Fedora, most of the required packages for this installation are included in the "Development tools" group, so it is recommended to install those first.
If you are not on Fedora, the packages can be found in the installation tutorial and each one has a specific use, you can opt-out of most of them if you do not require the functionality so read carefully what each one does.
http://www.gnustep.org/resources/documentation/User/GNUstep/gnustep-howto_2.html#Preliminaries
I tried installing the different packages and running the application, but I accidentally installed it through the install command, thinking it would work as well. I should have compiled it from source instead so my first attempt did not work.
After my first attempt I went to try a simpler program to see if I could get that to install.
I chose http://directory.fsf.org/wiki/Hello which is a modification to the Hello world program everyone writes when they just start programming. It is a demonstrative program meant to educate you. We install it by downloading the source from the provided link. this is a .tar.gz file, which means it is in compressed form.
We can decompress it by using the command
tar -xzf filelocation/filename.tar.gz
After which we install it by executing the following commands:
./configure
make
We can now run our Hello program by typing hello in the appropriate directory.
This is the standard for installing source packages, sometimes the names will be slightly different, for instance ./configuration, view the file names inside the package to see what your specific package requires.
After my success with the Hello program, I downloaded the tar.gz files for GNUstep and used the tar -xvf command to extract them.
After installing the gnustep-make package using ./configure and make, we have to run
After the compilation we can run the program from the commandline. In class I learned that we should not install the files with root privilege, because it could overwrite the packages already present in Fedora. Instead we can just run the programs from the directories in which we installed them.
In order to find them I will browse through the GNU project's free software directory at:
http://directory.fsf.org/wiki/GNU
My first choice is GNUstep, it is a graphical editor for Objective-C and it plans to add support for C++ in the future. We will look at the documentation offered through
http://www.gnustep.org/resources/documentation/User/GNUstep/userfaq_toc.html
in order to install the package.
On Fedora, most of the required packages for this installation are included in the "Development tools" group, so it is recommended to install those first.
If you are not on Fedora, the packages can be found in the installation tutorial and each one has a specific use, you can opt-out of most of them if you do not require the functionality so read carefully what each one does.
http://www.gnustep.org/resources/documentation/User/GNUstep/gnustep-howto_2.html#Preliminaries
I tried installing the different packages and running the application, but I accidentally installed it through the install command, thinking it would work as well. I should have compiled it from source instead so my first attempt did not work.
After my first attempt I went to try a simpler program to see if I could get that to install.
I chose http://directory.fsf.org/wiki/Hello which is a modification to the Hello world program everyone writes when they just start programming. It is a demonstrative program meant to educate you. We install it by downloading the source from the provided link. this is a .tar.gz file, which means it is in compressed form.
We can decompress it by using the command
tar -xzf filelocation/filename.tar.gz
After which we install it by executing the following commands:
./configure
make
We can now run our Hello program by typing hello in the appropriate directory.
This is the standard for installing source packages, sometimes the names will be slightly different, for instance ./configuration, view the file names inside the package to see what your specific package requires.
After my success with the Hello program, I downloaded the tar.gz files for GNUstep and used the tar -xvf command to extract them.
After installing the gnustep-make package using ./configure and make, we have to run
. /usr/local/share/GNUstep/Makefiles/GNUstep.sh
Then we install the other packages the same way we installed the make package.
We are now ready to run GNUstep, before we do though we have to make sure our environment is set up correctly by running the GNUstep.sh file again with the sh command.
After the compilation we can run the program from the commandline. In class I learned that we should not install the files with root privilege, because it could overwrite the packages already present in Fedora. Instead we can just run the programs from the directories in which we installed them.
dinsdag 12 januari 2016
Welcome to my blog
I am Berwout de Vries Robles, a Software Engineering student in my sixth semester. I am currently enrolled in the CPA program at Seneca College as an exchange student. My home country is the Netherlands, I will be in Canada for my minor semester after which I will return home. I will use this blog to publish course assignments and thoughts regarding Software Portability and Optimization. I hope that it will provide a thorough understanding of my work process and lead to a meaningful contribution for the large open source project we will be working on.
Abonneren op:
Posts (Atom)