maandag 21 maart 2016

SPO600 Compiler Optimizations

As we are starting projects shortly, this week there is no lab post. Instead we take a look at some gcc optimization flags. Everyone in the course was tasked with selecting two gcc compiler optimization flags and figuring out what they do.

My first choice is:
floop-strip-mine

DISCLAIMER:
We can not run the -floop-strip-mine flag right now because it requires a specific library to be built in to gcc at build time. This library is not present in most readily available gcc builds(including the one you probably have installed and the one I have installed). I have tried building gcc myself, but the process is fairly complicated and I could not get it to work. This means that this blog I will be showing a manual example of what some sample code would look like before and after translation, but there will be no assembly demonstrating the power of the actual flag.

the f before the options stands for flag, you can turn them on by adding the function as an argument to a compilation, for example:
gcc -floop-strip-mine program.C
Some optimizations are always on, some options, such as -O2, -O3 turn on entire groups of optimization flags. To turn off a specific flag use the -no prefix. for example:

gcc -no-floop-strip-mine program.C
The reason I chose floop-strip-mine is that our projects could potentially benefit a lot from multi-threading if they do not take advantage of that yet. floop-strip-mine is an optimization that is especially beneficial in a multi-threaded situation.

floop-strip-mine is a flag that turns on the strip-mining of loops. To see how that works, here is a pseudocode example:

Lets say we have two two-dimensional arrays, array a and b, that both have 8 rows and 8 columns.
Let us assume they are filled with values, what those values are doesn't really matter right now. We are going to loop through the arrays and do a multiplication, storing the result in a two-dimensional array c.
int[8][8] a, b, c;

for(int i=0; i<8; i++){
    for(int j=0;j<8;j++){
        c[i][j] = a[i][j] * b[j][i];
    }
}

The values in a and b are stored in main memory, if we pull them out of main memory into the cache, let's see what happens. Let's assume our cache-lines can store 4 integer values at a time.

The first iteration:
i = 0;
j = 0;
elements we need: a[0][0], b[0][0].
lines going from main memory into cache:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

The second iteration:
i = 0;
j = 1;
elements we need: a[0][1], b[1][0]
Hey that is nice! a[0][1] is already in cache so we can reuse that cache line!
lines going from main memory into cache:
line 3: b[1][0], b[1][1], b[1][2], b[1][3]
Sadly we had to pull in a new line for b, as the value was not in cache yet.

The third iteration:
i = 0;
j = 2;
elements we need: a[0][2], b[2][0]
As in the second iteration the a-line is already in cache, so we can reuse it, the b line however we have to pull in:
line 4: b[2][0], b[2][1], b[2][2], b[2][3]

The fourth iteration:
i=0;
j=3;
elements we need: a[0][3], b[3][0]
As in previous iterations, a is in cache and we pull in:
line 5: b[3][0], b[3][1], b[3][2], b[3][3]

The fifth iteration:
i=0;
j=4;
elements we need: a[0][4], b[4][0]
Now we need to also bring in a new line for a, note that we have been using the same line in all the previous iterations though:
line 6: a[0][4], a[0][5], a[0][6], a[0][7]
line 7: b[4][0], b[4][1], b[4][2], b[4][3]

We can sort of see a pattern here: a is nicely lined up in cache, if we proceed after the 4th iteration it will still only take up 2 lines of cache space. The b-array however is misaligned, every new iteration, we need to bring a new cache-line in from main memory! Note that this is still not a problem in a realistic situation where we have more than enough cache-lines to store all the values in b.

To illustrate what strip mining is and why it works let us assume we only have 8 cache lines to store the values of a and b in. Whenever a new cache line is brought in that does not fit in the cache, the oldest unused cache line gets thrown out.

In the next few iterations I am hoping to make clear what goes wrong here:
The sixth iteration:
i=0;
j=5;
elements we need a[0][5], b[5][0]
cache lines to pull in:
line 8: b[5][0], b[5][1], b[5][2], b[5][3]
Note that our cache is now full! We will start replacing the oldest unused line in the next iterations!

The seventh iteration:
i=0;
j=6;
elements we need a[0][6], b[6][0]
We find the oldest unused cache line, which is line 2, since we used line 1 up to the fourth iteration, cache lines to pull in:
replace line 2: b[6][0], b[6][1], b[6][2], b[6][3]

The eight iteration:
i=0;
j=7;
elements we need: a[0][7], b[7][0]
We find the oldest unused cache line, which is line 3 and replace it:
replace line 3: b[7][0], b[7][1], b[7][2], b[7][3]

This was the final iteration where i remained constant, so far we have not encountered any real problems, besides having to replace some cache lines, which is not a big deal yet. Now we look at the first few iterations where i=1;

The ninth iteration:
i=1;
j=0;
elements we need:a[1][0], b[1][1]
First we find the oldest unused cache line, which is line 4, replace that and then we find the second oldest unused cache line, which is line 1 and replace it:
replace line 4: a[1][0], a[1][1], a[1][2], a[1][3]
replace line 1: b[0][1], b[0][2], b[0][3], b[0][4]
Note here that we used to have the b-value b[0][1] in cache already, but we had to replace it with another value because our cache was full! This means we had to pull the same value from memory into cache twice!

The tenth iteration:
i=1;
j=1;
elements we need: a[1][1], b[1][1]
Again, we see the recurring pattern, we already have a[1][1] in the cache in line 4, but we have to replace a cache line to get the b-value in cache. We already had the b-value b[1][1] in cache in the second iteration, but we were forced to replace that line because our cache was full!

The pattern we are seeing here is that because the access of the b-array is structured as b[j][i] instead of b[i][j], we are running into trouble fitting the values into cache properly and creating cache-misses by replacing lines with values we still need later on!

Normally what you would do in a situation like this where the reverse access b[j][i] is causing the cache-misses is called loop-interchange. You can reverse the order of the loops so b[j][i] essentially becomes b[i][j]. However we can not do that in this case, because a[i][j] is also in this loop! Reversing the loops would create the exact same problem, only on the other side, after swapping the loops around we would have a lot of cache-misses on the a-array!

This is where strip mining comes in!
We take the code we had written for our original program and turn it into this:
int[8][8] a, b, c;

for(int x=0; x<8; x+=4){
    for(int y=0; y<8; y+=4){
        for(int i=0; i<min(x+4, 8), i++){
            for(int j=0; j<min(y+4, 8), j++){
                c[i][j] = a[i][j]*b[j][i];
           }
        }
    }
}
Now I admit that if you look at this the first time, programmatically it does not make a lot of sense. You are introducing two extra for loops and the loops look strange! However from a cache-viewpoint, this makes a lot of sense.

Here is what the iterations in this loop look like:
1st:
x=0, y=0, i=0; j=0
cache lines:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

2nd:
x=0, y=0, i=0; j=1
cache lines:
line 3: b[1][0],b[1][1],b[1][2],b[1][3]

3d:
x=0, y=0, i=0; j=2
cache lines:
line 4: b[2][0],b[2][1],b[2][2],b[2][3]

4th:
x=0, y=0, i=0; j=3
cache lines:
line 5: b[3][0],b[3][1],b[3][2],b[3][3]

5th:
j is no longer smaller than 4 (y+4) so instead we go up one loop and increase i!
x=0, y=0, i=1, j=0
cache lines:
line 6: a[1][0], a[1][1], a[1][2], a[1][3]
Note that here we still have the required b cache line!

6th:
x=0, y=0, i=1,j=1
cache lines:
No cache lines required! We have line 6 which contains a[1][1] and line 3 which contains b[1][1]!

7th:
x=0, y=0, i=1,j=2
cache lines:
No cache lines required! We have line 6 which contains a[1][2] and line 4 which contains b[2][1]!

Actually if you keep on iterating the loop, you can see that for the first 16 iterations, all we need is 8 cache lines! The lines we are missing now are a[2][0], a[2][1], a[2][2], a[2][3] and a[3][0], a[3][1], a[3][2], a[3][3], but that is it for the first 16 iterations! In fact, all groups of 16 iterations in the cycle only require 8 cache lines to store that particular block!

If we look at the arrays in a visual display, we are actually structuring the loops so it does these blocks of 4 by 4 first, before moving on to the next block! The reason for this is that these blocks happen to fit perfectly in our limited cache size.



In the real world obviously we have larger cache lines and more of them, but this concept applies in exactly the same way to arrays that are much larger than this one. This means that if you are working with very large two-dimensional arrays and traversing those arrays in a manner that is sub-optimal, strip-mining may be the solution for you!

Another time when strip-mining can be very valuable is if you are dealing with a multi-threading situation. Often when multi-threading, you will divide the data up into chunks that will be given to a specific processor. All the processors then run the same program, but on different pieces of data. This is a common way of managing workloads between multiple processors.

A problem that arises from this is cache-coherency between processors. If one processor has a cache line that contains a value that another processor has inside it's cache. and it updates a value on that same line. That whole line will need to be updated in the second processor, even though the second processor may not even need to use the specific value that was updated.

So if processor 1 was working on a[0][1], a[0][2] and processor 2 was working on a[0][3], a[0][4],
the cache-line that processor 1 is working on accidentally also contains a[0][3] and a[0][4] because we can only pull in entire cache-lines at a time. This means that if processor 2 updates a[0][3], processor 1 will require an update, even though it is not even actively working on a[0][3]! It just happened to be on the same cache-line. There is a lot of unnecessary communication between processors this way.

To prevent this unwarranted communication we strip-mine the loop and instead deal out the blocks of 16 values to different processors. This way there is no overlap in the cache-lines and each processor stays out of the other processors way. If you wish to know more about this problem, which is a very common problem in multi-processing, google for false-sharing.

So to recap:
We strip-mine because:
a. If our cache size is small, it prevents a lot of cache-misses.
b. If we are multi-threading, it prevents a lot of communication overhead caused by false sharing.

I am a little bit sad that I could not present you with a view of what gcc does with our code in terms of assembly, I will look into obtaining a gcc build with the necessary libraries and post some more on that if I find one.

zondag 28 februari 2016

SPO600 Lab 7

In this lab we are going to look at some software packages that use inline assembler, we are going to look at why it uses it, whether that use is correct and what platforms it is present for.

I will be looking at libmad or MAD, which is an MPEG Audio Decoding package. If you wish to follow along, the source code package is available on sourceforge at https://sourceforge.net/projects/mad/files/libmad/, as in demonstrated in lab 1, we can just extract the tar ball by using the tar -xvf [filename] command. We don't have to make the program to look at the source code. First off I will be using egrep -R "asm | __(asm)__*\(" to recursively search downwards in the map for which programs inside the package contain assembler.
This gives us 3 results, 2 of which are actual inline use of asssembler.
intl/libgnuintl.h.in and audio_jaguar.c

First I will be looking at the intl/libgnuintl.h.in to see if I can figure out what the instruction is there for, the instruction we are looking at is:
#ifdef _INTL_REDIRECT_ASM
# define _INTL_ASM(cname) __asm__ (_INTL_ASMNAME (__USER_LABEL_PREFIX__, #cname))
According to the comments this sets up an alternative prefix to the usual command, to be used if the libintl_ library does not work. The reason this is here, is that due to duplicate names in different libraries glibc 2.0 and newer and Solaris 2.4 overwrite a function used in the libintl_ library. I can only assume that this is a valid reason to rename it.

On to the audio_jaguar.c where it doesn't concern just the renaming of a function, here we find the following function:
/*
 * NAME:        float32()
 * DESCRIPTION: store 32-bit IEEE Standard 754 floating point representation
 */
static
void float32(void **ptr, mad_fixed_t sample)
{
  unsigned long **mem = (void *) ptr, ieee = 0;

  if (sample) {
    unsigned int zc;
    unsigned long abs, s, e, f;

    enum {
      BIAS = 0x7f
    };

    /* |seeeeeee|efffffff|ffffffff|ffffffff| */

    s = sample & 0x80000000UL;
    abs = s ? -sample : sample;

    /* PPC count leading zeros */
    asm ("cntlzw %0,%1" : "=r" (zc) : "r" (abs));

    /* calculate exponent */

    e = (((32 - MAD_F_FRACBITS - 1) - zc + BIAS) << 23) & 0x7f800000UL;

    /* normalize 1.f - round? */

    f = ((abs << zc) >> 8) & 0x007fffffUL;

    ieee = s | e | f;
  }
Here the assembly instruction is used to count the leading zero's in a register, used to convert a floating point value to the IEEE Standard 754(a fixed point representation). Note that the code offers no alternatives and is only valid on PowerPC. I think this is because most modern processors support this IEEE standard, therefore they do not need a conversion function. In my opinion it is a little bit sloppy to allow the function to break if ran on any other processor, interestingly enough this code was written to increase the portability of the library. The standard was absent on this specific platform so they wrote an alternative implementation.

That was it for this library, there was not that much assembly to be found in here. The assembly that was there, was written to increase portability to systems that have naming issues or lack the IEEE 754 standard. I would have liked to have seen an alternative in C that makes sure the latter function does not break on any other platform than PowerPC, but it is probably assumed that any modern system supports the IEEE 754 standard.


SPO600 Lab 6

For this lab we are writing a simple program to see if we can let the gcc compiler auto vectorization functionality kick in. We have seen in class that auto vectorization does not kick in at optimization levels lower than level three. The program is going to fill two 1000 integer arrays with random numbers, then sum them into a third array. Lastly it is going to add up all the numbers in the third array into a long integer.

Our code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

int main() {
    srand(time(NULL));
    int arraySize = 1000;
    int16_t* a = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    int16_t* b = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    int16_t* c = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    for (int i = 0; i < arraySize; i++) {
        a[i] = rand();
        b[i] = rand();
    }
    for (int j = 0; j < arraySize; j++) {
        c[j] = a[j] + b[j];
    }
    long sum = 0;
    for (int k = 0; k < arraySize; k++) {
        sum += c[k];
    }
    printf("%d\n", sum);
    free(a);
    free(b);
    free(c);
    return 0;
}
A couple of notes about this code:
I realize the three loops could have been one loop, but the specific wording in the assignment suggested the operations would have to be done in a sequential order so I implemented it this way on purpose. I suspect that it may also be easier to recognized as vectorizable by the compiler this way, but I will have to test some more on that later.
Note the assignment did not ask for printing the result, but I was worried the compiler might get rid of my code if I didn't do anything with the result and it is nice to have some confirmation that the program worked.

We will compile this code on the arch64 system we used in the previous lab called aarchie in order to have a look at some arch64 vectorization assembly language.

The compiler command we run is:
gcc -x c vectorization.C -o vectorization -O3 -std=c99 -g
The -std=99 option is necessary on aarchie because it has the 4.8.3 version of the gcc compiler and c89 is still default there. -O3 is to turn the optimizations we are looking for on. The -x c is to specify we want it compiled in C, for some reason on this code gcc was assuming I wanted to compile it as c++ and giving me back error messages, this solved that problem. -o is to specify the output filename. -g is to specify debug information.

I am going to look at our code using the objdump -d command, note that because we specified -O3 we gave the compiler permission to toss around our code making the assembly substantially harder to read. Because our code is so garbled up I am going to look for some excerpts showing that auto vectorization kicked in.

The first vectorization I found is:
  40071c:       4cdf7441        ld1     {v1.8h}, [x2], #16
  400720:       4cdf7460        ld1     {v0.8h}, [x3], #16

Here we are loading part of our first and second array in at the top, and using an incremental immediate offset of 16 to jump past the section we just did in the next iteration. (loading and incrementing the offset in one instruction).
  400724:       4e608420        add     v0.8h, v1.8h, v0.8h

We then call the add on the vector registers, storing the result in vector register 0.
  400728:       11000400        add     w0, w0, #0x1
This add instruction is for the loop control variable. The compiler chose to do this before storing the result so the processor gets some time to finish this operation before calling compare.
  40072c:       4c9f7420        st1     {v0.8h}, [x1], #16
 Here we are storing our result into a memory location pointed to by x1, also using an incremental immediate offset.
  400730:       6b04001f        cmp     w0, w4
  400734:       54ffff43        b.cc    40071c <main+0x13c>
The compare and branch back to the top if we aren't finished yet. Note that the optional incremental immediate offsets are saving us a lot of instructions otherwise spent in memory management here.

The second part we expected to be vectorized was the summation of the third array, that is found in this section:
 4008b8:       4cdf7420        ld1     {v0.8h}, [x1], #16
 Here we load the array into vector register 0, note it is still in the adress pointed to by x1 where it was stored in the previous section, we are also still using the incremental immediate offset.
  4008bc:       0f10a401        sxtl    v1.4s, v0.4h
  4008c0:       0f20a423        sxtl    v3.2d, v1.2s
  4008c4:       4f10a400        sxtl2   v0.4s, v0.8h
  4008c8:       4f20a421        sxtl2   v1.2d, v1.4s
  4008cc:       4ee28462        add     v2.2d, v3.2d, v2.2d
  4008d0:       4ee28422        add     v2.2d, v1.2d, v2.2d
  4008d4:       11000400        add     w0, w0, #0x1
  4008d8:       0f20a401        sxtl    v1.2d, v0.2s
  4008dc:       4ee28422        add     v2.2d, v1.2d, v2.2d
  4008e0:       4f20a400        sxtl2   v0.2d, v0.4s
  4008e4:       6b00007f        cmp     w3, w0
  4008e8:       4ee28402        add     v2.2d, v0.2d, v2.2d
  4008ec:       54fffe68        b.hi    4008b8 <main+0x2d8>
Above we see a somewhat hard to read mixture of additions and lengthening instructions. The add and compare instructions on the regular registers are the loop control variables. I think the reason the addition and lengthening instructions are so intermixed is because the compiler is reusing the original vector register(v0) that already contains the right values as much as possible. This way there is no need to move the originally loaded values around as much.

Something interesting to note here is that I am not sure why the compiler is lengthening the values to 64 bit signed integer, when a long should require only 32 bits. The compiler is also not making use of the vector reduce operation which would be the logical operation to use in this situation. That last part may be explained by the choice for 64 bit integers, because reducing 2 values is probably less efficient than vector addition. Furthermore, the compiler did all this vectorization without us providing it any hints so far, I will try to see if I can fix the second part with some hints in the code later on. I am glad that it was reasonably easy to find the vector instructions even with the O3 optimization level.

Lastly we were asked to look for a way to vectorize our sound volume program from last lab. Vector registers don't allow for operations across multiple datatypes so we have to think of something to convert our 32 bit float.

In the arm overview I found the following instruction:
FCVTxS Vd.<T>, Vn.<T>
This allows us to use vectorization to convert our 32 bit floats into 32 bit signed integers.
We can then promote our 16 bit signed integer samples to 32 bit signed integers using the vectorized lengthening instructions:
SXTL Vd.<Td>, Vn.<Ts>
SXTL2 Vd.<Td>, Vn.<Ts> 
Because the data are now the same type and length we can now use the following instruction to multiply the two:
FMUL Vd.<T>, Vn.<T>, Vm.<Ts>[index]
 I believe these instructions should lead to a valid vectorized sound multiplying program. Whether it is the most optimal solution I am not sure and I will have to look at this some more later.

That's it for now! I will revisit some of this later due to multiple things I am not 100% sure about at the moment.

maandag 22 februari 2016

SPO600 Lab 5

For this lab we will be looking at 2 different approaches to scaling 16 bit signed integer sound samples with a float to simulate different volumes. We will also be testing these approaches on 2 different machine architectures, one aarch64 and one x86_64 machine.

The first approach will be to naively scale the value and return the result. When scaling a sample size of n this will result in n multiplication instructions.

The second approach is to create a look up table where we store the result of every possible sample at it's index. We populate the look up table whenever the user changes the scaling factor, but during consecutive calls to the function with no change we can just look up the value in the table. With a 16 bit signed integer this results in 65536 multiplication instructions per volume change. Because the cache will probably be too small to hold the entire array, so this approach will be more memory intensive.

To achieve a clear difference we will be attempting to scale 500 million sound samples, this is roughly equal to 1 hour and 35 minutes of samples in 44 kHz stereo sound.

To estimate the time used by the scaling functions we will be using a control program that does the same memory allocation and basic operations as the program would do without the scaling functions. We will be timing these three programs to see which one does best.

The control program looks like this:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>

void fillSampleArray(int16_t* arr, int numOfElems);

int main()
{
    //creating an array of samples
    srand((int)time(NULL));
    int numberOfElements = 500000000;
    int16_t* sampleArray = (int16_t*)malloc(sizeof(int16_t) * numberOfElements);
    fillSampleArray(sampleArray, numberOfElements);

    int result = INT16_MIN;

    for (int i = 0; i < numberOfElements; i++) {
        result = result ^  sampleArray[i];
    }

    printf("%d", result);
    free(sampleArray);
    return 0;
}


void fillSampleArray(int16_t* arr, int numOfElems)
{
    for (int i = 0; i < numOfElems; i++)
    {
        arr[i] = ((rand() * 2) % (UINT16_MAX + 1)) + (INT16_MIN);
    }
}
some things to note about this code:
The reason we are using the ^ operator and printing a result is to make sure the compiler does not strip our code thinking that it is useless. We force it to use the code in some form of output so it can't get rid of it.

The reason we are multiplying rand() by 2 is that rand() has a maximum value of approximately 32000, we need to represent around 65000 numbers so this is our way of artificially doing so. Note that this creates only even numbers and it still only creates around 32000 distinct values. For our scaling function this does not matter as we just care about having a lot of samples.

For the 2 implementations we add this method signature above:
int16_t volumeScale(int16_t sample, float scale);
We have a small change to the main program to use the function, the result is now calculated as follows:
result = result ^ volumeScale(sampleArray[i], 0.750);
Our naive solution is fairly simple and looks as you would expect:
int16_t volumeScale(int16_t sample, float scale)
{
    return (int16_t)sample*scale;
}
The lookup function is a bit more complicated due to the need to save state across multiple iterations and allocation of the array to store the intermediate results.
 int16_t lookUpVolumeScale(int16_t sample, float scale)
{
    //previous volume
    static float pVolume = -1;

    //lookup table
    static int16_t* luTable=NULL;

    if(luTable==NULL){
        luTable = (int16_t*)malloc(sizeof(int16_t) * (UINT16_MAX + 1));
    }

    if(pVolume != scale)
    {
        for(int i=0; i<UINT16_MAX+1; i++)
        {
            //INT16_MIN is the offset
            luTable[i] = scale*(i+INT16_MIN);    
        }
        pVolume = scale;
    }
    return luTable[sample + INT16_MAX + 1];
}
Funny thing here, I am a C# / Java programmer who has never programmed anything in C besides a flashing led, so I did not actually know how to save state inside a function across multiple iterations. It took me 2 hours of googling to figure out that the static keyword does all this for you. Sometimes when you are searching for something considered to be very simple for a C programmer, it can actually be very hard to find a result that explains it.

Lastly we run a comparison running the three programs on both our platforms, we will be using the time command and the real duration in seconds for this comparison:

PlatformControlNaiveLookup
x86_64(Xerxes)10.72812.35419.238
aarch64(Aarchie)181.366286.152223.343

Some things to note here:
  • We have not changed the volume, this means that we are assuming our user is going to listen to 1,5 hours of music without ever changing the volume. 
  • The aarch64 system is compiling with a slightly older version of gcc. 
  • The time function is probably not the best approach to accurately time these programs. 

With these things in mind I think we can still say a few things about the results, due to the size of the gap between the solutions. Even without changing the volume, the x86_64 version performs better when naively scaling the sound samples. This probably has to do with memory access speed being the bottleneck in this case. On the aarch64 system however the lookup-table performs much better than the naive scaling. There is a full minute difference, even if you account for us not changing the volume this is a significant gap. This also means that the processing power is the bottleneck in this system.
In both cases we could easily scale the amount of sound we had in a fraction of the time required to listen to it, so there would be no problem with either approach, however if you are planning to scale a huge amount of sound samples this test was definitely worth the effort.

Next lab we will be looking at parallelising some of these operations to see if we can achieve an even faster result!

zondag 31 januari 2016

SPO600 Lab 4

In our fourth lab we looked at some of the different compiler options and the differences it made in the assembly output. In order to look at the differences we wrote a simple hello world program that was compiled with the different options.

I will go by the options we went over one by one below.

0. Original.
The original command we used was gcc -g -O0 -fno-builtin.
This version is used to compile a file with debugging information, no optimization and without use of builtin functionality. 
 
1. Static.
Adding -static means the compiler pulls in the entire library the function you are using is in. In this case it pulled in the library that contains the printf function we used to print our message. This results in about a 10 times increase in file size in this case. The advantage of using static is that no matter what system you go to, you can guarantee that the functions you are using will be present, since you include the entire library. The disadvantage is the strain on the file size and if you use static a lot, a lot of your programs will have the same libraries, but with different versions. It will be hard to keep track of security flaws and versioning.

2. Not using fno-builtin.
 fno-builtin means you are asking the compiler not to use the builtin functionality. In this case the compiler recognizes it has a more efficient way to do the printf function because it is only using 1 argument. The compiler realizes that it can use the builtin function puts instead for more efficiency.

3. Removing -g .
 By removing -g from the command we are compiling the file in without debug information, this means the file becomes smaller, but also harder to read. the --source command no longer shows you where the compiler put what part of your code.

4. Adding additional arguments to the printf.
On aarch64 after assigning 7 arguments to different registers, the rest of the arguments get put on the stack through r1. On x86_64 there were only 5 arguments assigned to the registers before putting the rest on the stack. This makes sense because x86_64 has less registers to work with. Interestingly enough, on the aarch64 side the assembly made use of store instead of put to add the arguments to the stack.

5. Moving printf to a separate function called output and calling that from the main.
With this one I thought the compiler would recognise that it is just calling printf and the function did not really have any use, but I was wrong. The assembly called output from the main and in the output was the same section that was previously in the main. If you use arguments to pass the string it also stores some overhead on the stack.

6. Replace O0 with O3.
This instruction asks the compiler to use a lot of optimization(severe), it will actually move some of your code around to do so, so if you compile something with O3 you always have to check whether the functionality of your program remains the same. One of the differences in the output is that instead of returning to the main function after calling the printf function the compiler realized that the printf was the last statement and with an unconditional branch told the program not to bother coming back to the main. Severe optimization also deleted the set up for the stack in this case, because this program makes no use of it.

So this week we saw a small part of what the compiler can do for us and what we can ask it to do. I found it very interesting to finally see what makes it so that any code we write can actually run. I think understanding how the compiler turns our code into assembler can really help us decide what options to use when we want to go for maximum efficiency in our code.

woensdag 27 januari 2016

SPO600 Lab 3


For our third lab we split up in groups to write some assembly, we started by looking at some x86_64 assembler, after which we looked at the differences between that and aarch64 assembler.

Our first assignment was to build a simple loop that loops from 0-9 and displays "Loop: [the current number] \n" of the loop. In order to convert our loop number to an ASCII-character we have to add 48 to the number because this is the offset of characters where the 0 starts in ASCII. We did this by moving 48 into an empty register that doesn't get trampled at the start of the program. Then in the loop we moved the loop number into an empty register, added 48 from the previous register and wrote the result into the rsi register(message location register).
At first we had our string in a .rodata section, this stands for read-only, so when we tried to write into it, nothing happened. We fixed this by removing the ro to spell .data
Next we did not realise that mov moves the full 64 bytes. Even if you only had 1 byte of data.
This caused our number's move to overwrite the newline character at the end of the original string. This is fixed by adding the b suffix to mov and the register you wish to move it from. So it becomes something like: movb %r13b,(msg+6).
On aarch64 moving into the message with an offset was a little bit different, but otherwise the instructions were similar.

Next we changed it so the loop would print from 00-30. This used the div instruction, which takes the rax register and divides it by a chosen register. It stores the result into rax and the remainder into rdx. We divided the loop number by 10, added our offset of 48 to rax and rdx and moved them into the string as we had done in the previous assignment.
One thing to pay attention to: before using the div operation, the rdx register has to be set to 0!
On aarch64 this was similar, but you had to use msub to calculate the remainder.

The last assignment was supressing the first digit when the number was below 10 We did this by checking if the quotient was zero, if it was, we use je(jump if equal) or be(branch if equal) to jump to a label past the mov assignment to move the first digit into the register.

Tips and tricks:
as -g -o filename.o filename.s; ld -o filename filename.o; ./filename compiles and executes a source file in one statement.
Make sure you know whether you are on least significant byte first architecture or most significant byte architecture as it could heavily influence the result of moving a specific byte somewhere.
Copy your files regularly so if you add to a file and you break it, you still have a working copy to try something new on.

Observations about the differences between aarch64 and x86_64 assembler:
- If you need just the remainder on aarch64 you still have to divide, where you get the remainder built in on x86_64
- When moving just a byte on aarch64 you specify a 32 bit-wide register and you work with an offset, whereas on x86_64 you have the option to specify byte sized.

donderdag 14 januari 2016

SPO600 Lab 2

For this weeks lab we will be looking at two Open Source packages from the Free Software Foundation and installing them.
In order to find them I will browse through the GNU project's free software directory at:
http://directory.fsf.org/wiki/GNU

My first choice is GNUstep, it is a graphical editor for Objective-C and it plans to add support for C++ in the future. We will look at the documentation offered through
http://www.gnustep.org/resources/documentation/User/GNUstep/userfaq_toc.html
 in order to install the package.

On Fedora, most of the required packages for this installation are included in the "Development tools" group, so it is recommended to install those first.

If you are not on Fedora, the packages can be found in the installation tutorial and each one has a specific use, you can opt-out of most of them if you do not require the functionality so read carefully what each one does.
http://www.gnustep.org/resources/documentation/User/GNUstep/gnustep-howto_2.html#Preliminaries

I tried installing the different packages and running the application, but I accidentally installed it through the install command, thinking it would work as well. I should have compiled it from source instead so my first attempt did not work.

After my first attempt I went to try a simpler program to see if I could get that to install.
I chose http://directory.fsf.org/wiki/Hello which is a modification to the Hello world program everyone writes when they just start programming. It is a demonstrative program meant to educate you. We install it by downloading the source from the provided link. this is a .tar.gz file, which means it is in compressed form.
We can decompress it by using the command
tar -xzf filelocation/filename.tar.gz
After which we install it by executing the following commands:
./configure
make

We can now run our Hello program by typing hello in the appropriate directory.

This is the standard for installing source packages, sometimes the names will be slightly different, for instance ./configuration, view the file names inside the package to see what your specific package requires.

After my success with the Hello program, I downloaded the tar.gz files for GNUstep and used the tar -xvf command to extract them.
After installing the gnustep-make package using ./configure and make, we have to run
 . /usr/local/share/GNUstep/Makefiles/GNUstep.sh
Then we install the other packages the same way we installed the make package.
We are now ready to run GNUstep, before we do though we have to make sure our environment is set up correctly by running the GNUstep.sh file again with the sh command.

After the compilation we can run the program from the commandline. In class I learned that we should not install the files with root privilege, because it could overwrite the packages already present in Fedora. Instead we can just run the programs from the directories in which we installed them.