donderdag 21 april 2016

Update on phase 3

Quick update:

I got a reply to my github pull request. Sadly it was not what I had hoped for. The package manager replied that they want to give users full control of their compiler optimizations and not force anything on them.

 As there is theoretically no downside to applying atleast a base level of optimization, this argument seems a little bit strange to me.

The way I would do it is give the user a base level of optimization, so that users that have no knowledge of compiler optimizations get the benefit in the significant speedup without having to do anything. Advanced users could then opt-out of the optimization if they want to. I realize that this is more work for advanced users. However I think the situation where the user opts out of optimizations would come up very rarely. Whereas the situation where the user should have applied some degree of optimization, but did not because the user didn't know he should have, is more prevalent.

Phase 3 Pushing changes upstream

To get my changes accepted upstream, the first thing I needed to figure out was where in the make / build process to insert my change. Since in the phase 2 optimization testing I only directly edited the Makefile to see the changes.

It turns out ncompress uses a Makefile.def, which contains the definitions to be used for the make call. The package is not built with ./configure like most, but instead, after cloning you can directly make install. This means I had to add the following 2 lines to the Makefile.def:
#Compiler optimization flags
CFLAGS=-O3
I tested whether the package still built correctly with my change to the Makefile.def and it built as it should with the -O3 optimization turned on.

Next I had to figure out how to contribute to the package. Luckily since the package is available on github I can fork the package. So that is what I did, I forked the package, made the change and committed it to my branch. I then created the following pull request:

https://github.com/vapier/ncompress/pull/7

I've added a link to my blog and a short description in the pull request, since there is no other means of communication described on the project page, now all we can do is wait for a response!

zondag 10 april 2016

Phase 2 Process

So in my previous blog I found an optimization, but I was not too happy about it because it seemed like such a simple step to take. I guess that is sort of the point with compiler optimizations though. The more efficiently you can get things done, the better. The compiler assists with this very handily.

I tried to get the compiler to tell me some possible optimizations as well, using -fopt-info-vec-missed to see what vectorization possibilities it missed and why. The compiler came up with about 500 lines of missed vectorization, so I wasn't sure where to start with that. I then used -fopt-info-vec to see that some things did actually get vectorized, but that got me no further either.

I tried profiling the program using gprof, it resulted in me finding out that the program spends its entire time in a single large function, whether compressing or uncompressing. This brought me no closer to finding any other optimization.

I have fiddled with the DUSERMEM and DREGISTERS to see if it makes a difference, but I think there is a set maximum to these values somewhere, although I do not know why. To me they are sort of magic numbers right now.

All in all the search and constant waiting for benchmarks was frustrating to me, especially because I found next to nothing besides the obvious compiler flag. If I was to attempt a project like this again. I would have an extra phase and my phase 2 would be write an automatic benchmarking suite where you press a button and a reliable benchmark rolls out the other end.

I also wouldn't use my own laptop to benchmark on, because for some reason the Virtual Machine gives very varying performances and I can never tell if it is an accurate representation of change or not. Luckily betty had my sanity covered there.

Project Phase 2

After last week where we ended our project selection with a somewhat hasty switch to another project I am happy to be able to just dig into this package a little bit. Sadly I have been sick half of this week(still am) so the progress is slower than I would have hoped.

In my project selection blog I mentioned that I wanted to look at vectorization options the compiler can do for us if we make some small changes to the code. However before looking into the code I went to take a look at the makefile to see if there are any vectorization flags currently turned on.

This is what the compiler options line looks like on both betty and my own x86 machine:
options= $(CFLAGS) -DNOFUNCDEF -DUTIME_H $(LDFLAGS) $(CFLAGS) $(CPPFLAGS) -DDIRENT=1 -DUSERMEM=800000 -DREGISTERS=3
What I found out is that CFLAGS, CPPFLAGS and LDFLAGS are all undefined in this case. What I also found strange is that these options define a set memory for the program to use and a number of registers the program can use.

The most glaring lack here is that there is not even a base level of optimization applied, so I set out to test that first. To verify that the code has not been changed in a way that it now has a different meaning, we will be using md5 checksum to see if our compressed and uncompressed files remain the same. After slowly ramping up the optimization level, it turns out that the program works on O3, without it causing any problems.

During benchmarks I first encountered a couple of runs where it looked like uncompressing the files was slower than before applying the optimization, however after running some more benchmarks, it seems to me that it was probably background noise from other processes running on my system.

After the -O3 optimization level had been applied we see the following results in the benchmarks:
aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real:    64.467s
user:   63.280s
sys:     1.180s

1073741824(1GB) integerstream:
real:    9.413s
user:   8.980s
sys:     0.430s


uncompress:
2545934560(~2.4GB) textfile:
real:    10.882s
user:   9.510s
sys:     1.370s

1073741824(1GB) integerstream:
real:    4.734s
user:   3.920s
sys:     0.700s

my own x86_64 Fedora 23 installation:
compress: 
2545934560(~2.4GB) textfile:
real:    34.812s
user:   18.821s
sys:     4.715s

1073741824(1GB) integerstream:
real:    13.274s
user:   3.789s
sys:     1.651s

uncompress: 
2545934560(~2.4GB) textfile:
real:    45.669s
user:   5.779s
sys:     2.339s

1073741824(1GB) integerstream: 
real:    17.713s
user:   2.814s
sys:     1.107s


If we compare that with last week's results we find that in all of the cases the compression was faster with the optimization turned on. However oddly enough on my local machine the uncompress was notably slower! This could also be explained by the amount of possible disturbances in the background processes of my machine since I am running a Virtual machine to test with. However I have tested extensively and have yet to see the optimized version of uncompress not do worse.

To go about implementing this optimization in the actual program I will have to add it to the Makefile recipe, for the testing I have just been editing the Makefile itself. I have also been looking for a reason as to why compression on the aarch64 server betty is so terribly slow, while uncompress is very fast, but I have yet to find the answer.