zondag 28 februari 2016

SPO600 Lab 6

For this lab we are writing a simple program to see if we can let the gcc compiler auto vectorization functionality kick in. We have seen in class that auto vectorization does not kick in at optimization levels lower than level three. The program is going to fill two 1000 integer arrays with random numbers, then sum them into a third array. Lastly it is going to add up all the numbers in the third array into a long integer.

Our code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

int main() {
    srand(time(NULL));
    int arraySize = 1000;
    int16_t* a = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    int16_t* b = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    int16_t* c = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
    for (int i = 0; i < arraySize; i++) {
        a[i] = rand();
        b[i] = rand();
    }
    for (int j = 0; j < arraySize; j++) {
        c[j] = a[j] + b[j];
    }
    long sum = 0;
    for (int k = 0; k < arraySize; k++) {
        sum += c[k];
    }
    printf("%d\n", sum);
    free(a);
    free(b);
    free(c);
    return 0;
}
A couple of notes about this code:
I realize the three loops could have been one loop, but the specific wording in the assignment suggested the operations would have to be done in a sequential order so I implemented it this way on purpose. I suspect that it may also be easier to recognized as vectorizable by the compiler this way, but I will have to test some more on that later.
Note the assignment did not ask for printing the result, but I was worried the compiler might get rid of my code if I didn't do anything with the result and it is nice to have some confirmation that the program worked.

We will compile this code on the arch64 system we used in the previous lab called aarchie in order to have a look at some arch64 vectorization assembly language.

The compiler command we run is:
gcc -x c vectorization.C -o vectorization -O3 -std=c99 -g
The -std=99 option is necessary on aarchie because it has the 4.8.3 version of the gcc compiler and c89 is still default there. -O3 is to turn the optimizations we are looking for on. The -x c is to specify we want it compiled in C, for some reason on this code gcc was assuming I wanted to compile it as c++ and giving me back error messages, this solved that problem. -o is to specify the output filename. -g is to specify debug information.

I am going to look at our code using the objdump -d command, note that because we specified -O3 we gave the compiler permission to toss around our code making the assembly substantially harder to read. Because our code is so garbled up I am going to look for some excerpts showing that auto vectorization kicked in.

The first vectorization I found is:
  40071c:       4cdf7441        ld1     {v1.8h}, [x2], #16
  400720:       4cdf7460        ld1     {v0.8h}, [x3], #16

Here we are loading part of our first and second array in at the top, and using an incremental immediate offset of 16 to jump past the section we just did in the next iteration. (loading and incrementing the offset in one instruction).
  400724:       4e608420        add     v0.8h, v1.8h, v0.8h

We then call the add on the vector registers, storing the result in vector register 0.
  400728:       11000400        add     w0, w0, #0x1
This add instruction is for the loop control variable. The compiler chose to do this before storing the result so the processor gets some time to finish this operation before calling compare.
  40072c:       4c9f7420        st1     {v0.8h}, [x1], #16
 Here we are storing our result into a memory location pointed to by x1, also using an incremental immediate offset.
  400730:       6b04001f        cmp     w0, w4
  400734:       54ffff43        b.cc    40071c <main+0x13c>
The compare and branch back to the top if we aren't finished yet. Note that the optional incremental immediate offsets are saving us a lot of instructions otherwise spent in memory management here.

The second part we expected to be vectorized was the summation of the third array, that is found in this section:
 4008b8:       4cdf7420        ld1     {v0.8h}, [x1], #16
 Here we load the array into vector register 0, note it is still in the adress pointed to by x1 where it was stored in the previous section, we are also still using the incremental immediate offset.
  4008bc:       0f10a401        sxtl    v1.4s, v0.4h
  4008c0:       0f20a423        sxtl    v3.2d, v1.2s
  4008c4:       4f10a400        sxtl2   v0.4s, v0.8h
  4008c8:       4f20a421        sxtl2   v1.2d, v1.4s
  4008cc:       4ee28462        add     v2.2d, v3.2d, v2.2d
  4008d0:       4ee28422        add     v2.2d, v1.2d, v2.2d
  4008d4:       11000400        add     w0, w0, #0x1
  4008d8:       0f20a401        sxtl    v1.2d, v0.2s
  4008dc:       4ee28422        add     v2.2d, v1.2d, v2.2d
  4008e0:       4f20a400        sxtl2   v0.2d, v0.4s
  4008e4:       6b00007f        cmp     w3, w0
  4008e8:       4ee28402        add     v2.2d, v0.2d, v2.2d
  4008ec:       54fffe68        b.hi    4008b8 <main+0x2d8>
Above we see a somewhat hard to read mixture of additions and lengthening instructions. The add and compare instructions on the regular registers are the loop control variables. I think the reason the addition and lengthening instructions are so intermixed is because the compiler is reusing the original vector register(v0) that already contains the right values as much as possible. This way there is no need to move the originally loaded values around as much.

Something interesting to note here is that I am not sure why the compiler is lengthening the values to 64 bit signed integer, when a long should require only 32 bits. The compiler is also not making use of the vector reduce operation which would be the logical operation to use in this situation. That last part may be explained by the choice for 64 bit integers, because reducing 2 values is probably less efficient than vector addition. Furthermore, the compiler did all this vectorization without us providing it any hints so far, I will try to see if I can fix the second part with some hints in the code later on. I am glad that it was reasonably easy to find the vector instructions even with the O3 optimization level.

Lastly we were asked to look for a way to vectorize our sound volume program from last lab. Vector registers don't allow for operations across multiple datatypes so we have to think of something to convert our 32 bit float.

In the arm overview I found the following instruction:
FCVTxS Vd.<T>, Vn.<T>
This allows us to use vectorization to convert our 32 bit floats into 32 bit signed integers.
We can then promote our 16 bit signed integer samples to 32 bit signed integers using the vectorized lengthening instructions:
SXTL Vd.<Td>, Vn.<Ts>
SXTL2 Vd.<Td>, Vn.<Ts> 
Because the data are now the same type and length we can now use the following instruction to multiply the two:
FMUL Vd.<T>, Vn.<T>, Vm.<Ts>[index]
 I believe these instructions should lead to a valid vectorized sound multiplying program. Whether it is the most optimal solution I am not sure and I will have to look at this some more later.

That's it for now! I will revisit some of this later due to multiple things I am not 100% sure about at the moment.

Geen opmerkingen:

Een reactie posten