SPO600 Berwout: februari 2016

zondag 28 februari 2016

SPO600 Lab 7

In this lab we are going to look at some software packages that use inline assembler, we are going to look at why it uses it, whether that use is correct and what platforms it is present for.

I will be looking at libmad or MAD, which is an MPEG Audio Decoding package. If you wish to follow along, the source code package is available on sourceforge at https://sourceforge.net/projects/mad/files/libmad/, as in demonstrated in lab 1, we can just extract the tar ball by using the tar -xvf [filename] command. We don't have to make the program to look at the source code. First off I will be using egrep -R "asm | __(asm)__*\(" to recursively search downwards in the map for which programs inside the package contain assembler.
This gives us 3 results, 2 of which are actual inline use of asssembler.
intl/libgnuintl.h.in and audio_jaguar.c

First I will be looking at the intl/libgnuintl.h.in to see if I can figure out what the instruction is there for, the instruction we are looking at is:

#ifdef _INTL_REDIRECT_ASM
# define _INTL_ASM(cname) __asm__ (_INTL_ASMNAME (__USER_LABEL_PREFIX__, #cname))

According to the comments this sets up an alternative prefix to the usual command, to be used if the libintl_ library does not work. The reason this is here, is that due to duplicate names in different libraries glibc 2.0 and newer and Solaris 2.4 overwrite a function used in the libintl_ library. I can only assume that this is a valid reason to rename it.

On to the audio_jaguar.c where it doesn't concern just the renaming of a function, here we find the following function:

/*
* NAME:        float32()
* DESCRIPTION: store 32-bit IEEE Standard 754 floating point representation
*/
static
void float32(void **ptr, mad_fixed_t sample)
{
unsigned long **mem = (void *) ptr, ieee = 0;

if (sample) {
    unsigned int zc;
    unsigned long abs, s, e, f;

    enum {
      BIAS = 0x7f
    };

    /* |seeeeeee|efffffff|ffffffff|ffffffff| */

    s = sample & 0x80000000UL;
    abs = s ? -sample : sample;

    /* PPC count leading zeros */
    asm ("cntlzw %0,%1" : "=r" (zc) : "r" (abs));

    /* calculate exponent */

    e = (((32 - MAD_F_FRACBITS - 1) - zc + BIAS) << 23) & 0x7f800000UL;

    /* normalize 1.f - round? */

    f = ((abs << zc) >> 8) & 0x007fffffUL;

    ieee = s | e | f;
}

Here the assembly instruction is used to count the leading zero's in a register, used to convert a floating point value to the IEEE Standard 754(a fixed point representation). Note that the code offers no alternatives and is only valid on PowerPC. I think this is because most modern processors support this IEEE standard, therefore they do not need a conversion function. In my opinion it is a little bit sloppy to allow the function to break if ran on any other processor, interestingly enough this code was written to increase the portability of the library. The standard was absent on this specific platform so they wrote an alternative implementation.

That was it for this library, there was not that much assembly to be found in here. The assembly that was there, was written to increase portability to systems that have naming issues or lack the IEEE 754 standard. I would have liked to have seen an alternative in C that makes sure the latter function does not break on any other platform than PowerPC, but it is probably assumed that any modern system supports the IEEE 754 standard.

SPO600 Lab 6

For this lab we are writing a simple program to see if we can let the gcc compiler auto vectorization functionality kick in. We have seen in class that auto vectorization does not kick in at optimization levels lower than level three. The program is going to fill two 1000 integer arrays with random numbers, then sum them into a third array. Lastly it is going to add up all the numbers in the third array into a long integer.

Our code looks as follows:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

int main() {
    srand(time(NULL));
    int arraySize = 1000;
    int16_t* a = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    int16_t* b = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    int16_t* c = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    for (int i = 0; i < arraySize; i++) {
        a[i] = rand();
        b[i] = rand();
    }
    for (int j = 0; j < arraySize; j++) {
        c[j] = a[j] + b[j];
    }
    long sum = 0;
    for (int k = 0; k < arraySize; k++) {
        sum += c[k];
    }
    printf("%d\n", sum);
    free(a);
    free(b);
    free(c);
    return 0;
}

A couple of notes about this code:
I realize the three loops could have been one loop, but the specific wording in the assignment suggested the operations would have to be done in a sequential order so I implemented it this way on purpose. I suspect that it may also be easier to recognized as vectorizable by the compiler this way, but I will have to test some more on that later.
Note the assignment did not ask for printing the result, but I was worried the compiler might get rid of my code if I didn't do anything with the result and it is nice to have some confirmation that the program worked.

We will compile this code on the arch64 system we used in the previous lab called aarchie in order to have a look at some arch64 vectorization assembly language.

The compiler command we run is:

gcc -x c vectorization.C -o vectorization -O3 -std=c99 -g

The -std=99 option is necessary on aarchie because it has the 4.8.3 version of the gcc compiler and c89 is still default there. -O3 is to turn the optimizations we are looking for on. The -x c is to specify we want it compiled in C, for some reason on this code gcc was assuming I wanted to compile it as c++ and giving me back error messages, this solved that problem. -o is to specify the output filename. -g is to specify debug information.

I am going to look at our code using the objdump -d command, note that because we specified -O3 we gave the compiler permission to toss around our code making the assembly substantially harder to read. Because our code is so garbled up I am going to look for some excerpts showing that auto vectorization kicked in.

The first vectorization I found is:

40071c: 4cdf7441 ld1 {v1.8h}, [x2], #16
400720: 4cdf7460 ld1 {v0.8h}, [x3], #16

Here we are loading part of our first and second array in at the top, and using an incremental immediate offset of 16 to jump past the section we just did in the next iteration. (loading and incrementing the offset in one instruction).

400724: 4e608420 add v0.8h, v1.8h, v0.8h

We then call the add on the vector registers, storing the result in vector register 0.

400728: 11000400 add w0, w0, #0x1

This add instruction is for the loop control variable. The compiler chose to do this before storing the result so the processor gets some time to finish this operation before calling compare.

40072c: 4c9f7420 st1 {v0.8h}, [x1], #16

Here we are storing our result into a memory location pointed to by x1, also using an incremental immediate offset.

400730: 6b04001f cmp w0, w4
400734: 54ffff43 b.cc 40071c <main+0x13c>

The compare and branch back to the top if we aren't finished yet. Note that the optional incremental immediate offsets are saving us a lot of instructions otherwise spent in memory management here.

The second part we expected to be vectorized was the summation of the third array, that is found in this section:

4008b8: 4cdf7420 ld1 {v0.8h}, [x1], #16

Here we load the array into vector register 0, note it is still in the adress pointed to by x1 where it was stored in the previous section, we are also still using the incremental immediate offset.

4008bc:       0f10a401        sxtl    v1.4s, v0.4h
4008c0:       0f20a423        sxtl    v3.2d, v1.2s
4008c4:       4f10a400        sxtl2   v0.4s, v0.8h
4008c8:       4f20a421        sxtl2   v1.2d, v1.4s
4008cc:       4ee28462        add     v2.2d, v3.2d, v2.2d
4008d0:       4ee28422        add     v2.2d, v1.2d, v2.2d
4008d4:       11000400        add     w0, w0, #0x1
4008d8:       0f20a401        sxtl    v1.2d, v0.2s
4008dc:       4ee28422        add     v2.2d, v1.2d, v2.2d
4008e0:       4f20a400        sxtl2   v0.2d, v0.4s
4008e4:       6b00007f        cmp     w3, w0
4008e8:       4ee28402        add     v2.2d, v0.2d, v2.2d
4008ec:       54fffe68        b.hi    4008b8 <main+0x2d8>

Above we see a somewhat hard to read mixture of additions and lengthening instructions. The add and compare instructions on the regular registers are the loop control variables. I think the reason the addition and lengthening instructions are so intermixed is because the compiler is reusing the original vector register(v0) that already contains the right values as much as possible. This way there is no need to move the originally loaded values around as much.

Something interesting to note here is that I am not sure why the compiler is lengthening the values to 64 bit signed integer, when a long should require only 32 bits. The compiler is also not making use of the vector reduce operation which would be the logical operation to use in this situation. That last part may be explained by the choice for 64 bit integers, because reducing 2 values is probably less efficient than vector addition. Furthermore, the compiler did all this vectorization without us providing it any hints so far, I will try to see if I can fix the second part with some hints in the code later on. I am glad that it was reasonably easy to find the vector instructions even with the O3 optimization level.

Lastly we were asked to look for a way to vectorize our sound volume program from last lab. Vector registers don't allow for operations across multiple datatypes so we have to think of something to convert our 32 bit float.

In the arm overview I found the following instruction:

FCVTxS Vd.<T>, Vn.<T>

This allows us to use vectorization to convert our 32 bit floats into 32 bit signed integers.
We can then promote our 16 bit signed integer samples to 32 bit signed integers using the vectorized lengthening instructions:

SXTL Vd.<Td>, Vn.<Ts>

SXTL2 Vd.<Td>, Vn.<Ts>

Because the data are now the same type and length we can now use the following instruction to multiply the two:

FMUL Vd.<T>, Vn.<T>, Vm.<Ts>[index]

I believe these instructions should lead to a valid vectorized sound multiplying program. Whether it is the most optimal solution I am not sure and I will have to look at this some more later.

That's it for now! I will revisit some of this later due to multiple things I am not 100% sure about at the moment.

maandag 22 februari 2016

SPO600 Lab 5

For this lab we will be looking at 2 different approaches to scaling 16 bit signed integer sound samples with a float to simulate different volumes. We will also be testing these approaches on 2 different machine architectures, one aarch64 and one x86_64 machine.

The first approach will be to naively scale the value and return the result. When scaling a sample size of n this will result in n multiplication instructions.

The second approach is to create a look up table where we store the result of every possible sample at it's index. We populate the look up table whenever the user changes the scaling factor, but during consecutive calls to the function with no change we can just look up the value in the table. With a 16 bit signed integer this results in 65536 multiplication instructions per volume change. Because the cache will probably be too small to hold the entire array, so this approach will be more memory intensive.

To achieve a clear difference we will be attempting to scale 500 million sound samples, this is roughly equal to 1 hour and 35 minutes of samples in 44 kHz stereo sound.

To estimate the time used by the scaling functions we will be using a control program that does the same memory allocation and basic operations as the program would do without the scaling functions. We will be timing these three programs to see which one does best.

The control program looks like this:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>

void fillSampleArray(int16_t* arr, int numOfElems);

int main()
{
    //creating an array of samples
    srand((int)time(NULL));
    int numberOfElements = 500000000;
    int16_t* sampleArray = (int16_t*)malloc(sizeof(int16_t) * numberOfElements);
    fillSampleArray(sampleArray, numberOfElements);

    int result = INT16_MIN;

    for (int i = 0; i < numberOfElements; i++) {
        result = result ^ sampleArray[i];
    }

    printf("%d", result);
    free(sampleArray);
    return 0;
}

void fillSampleArray(int16_t* arr, int numOfElems)
{
    for (int i = 0; i < numOfElems; i++)
    {
        arr[i] = ((rand() * 2) % (UINT16_MAX + 1)) + (INT16_MIN);
    }
}

some things to note about this code:
The reason we are using the ^ operator and printing a result is to make sure the compiler does not strip our code thinking that it is useless. We force it to use the code in some form of output so it can't get rid of it.

The reason we are multiplying rand() by 2 is that rand() has a maximum value of approximately 32000, we need to represent around 65000 numbers so this is our way of artificially doing so. Note that this creates only even numbers and it still only creates around 32000 distinct values. For our scaling function this does not matter as we just care about having a lot of samples.

For the 2 implementations we add this method signature above:

int16_t volumeScale(int16_t sample, float scale);

We have a small change to the main program to use the function, the result is now calculated as follows:

result = result ^ volumeScale(sampleArray[i], 0.750);

Our naive solution is fairly simple and looks as you would expect:

int16_t volumeScale(int16_t sample, float scale)
{
return (int16_t)sample*scale;
}

The lookup function is a bit more complicated due to the need to save state across multiple iterations and allocation of the array to store the intermediate results.

int16_t lookUpVolumeScale(int16_t sample, float scale)
{
   //previous volume
   static float pVolume = -1;

   //lookup table
   static int16_t* luTable=NULL;

   if(luTable==NULL){
       luTable = (int16_t*)malloc(sizeof(int16_t) * (UINT16_MAX + 1));
   }

   if(pVolume != scale)
   {
       for(int i=0; i<UINT16_MAX+1; i++)
       {
           //INT16_MIN is the offset
           luTable[i] = scale*(i+INT16_MIN);
       }
       pVolume = scale;
   }
   return luTable[sample + INT16_MAX + 1];
}

Funny thing here, I am a C# / Java programmer who has never programmed anything in C besides a flashing led, so I did not actually know how to save state inside a function across multiple iterations. It took me 2 hours of googling to figure out that the static keyword does all this for you. Sometimes when you are searching for something considered to be very simple for a C programmer, it can actually be very hard to find a result that explains it.

Lastly we run a comparison running the three programs on both our platforms, we will be using the time command and the real duration in seconds for this comparison:

Platform	Control	Naive	Lookup
x86_64(Xerxes)	10.728	12.354	19.238
aarch64(Aarchie)	181.366	286.152	223.343

Some things to note here:

We have not changed the volume, this means that we are assuming our user is going to listen to 1,5 hours of music without ever changing the volume.
The aarch64 system is compiling with a slightly older version of gcc.
The time function is probably not the best approach to accurately time these programs.

With these things in mind I think we can still say a few things about the results, due to the size of the gap between the solutions. Even without changing the volume, the x86_64 version performs better when naively scaling the sound samples. This probably has to do with memory access speed being the bottleneck in this case. On the aarch64 system however the lookup-table performs much better than the naive scaling. There is a full minute difference, even if you account for us not changing the volume this is a significant gap. This also means that the processing power is the bottleneck in this system.
In both cases we could easily scale the amount of sound we had in a fraction of the time required to listen to it, so there would be no problem with either approach, however if you are planning to scale a huge amount of sound samples this test was definitely worth the effort.

Next lab we will be looking at parallelising some of these operations to see if we can achieve an even faster result!