Our code looks as follows:
#include <stdlib.h>A couple of notes about this code:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
int main() {
srand(time(NULL));
int arraySize = 1000;
int16_t* a = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
int16_t* b = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
int16_t* c = (int16_t*)malloc(sizeof(int16_t*) * arraySize);
for (int i = 0; i < arraySize; i++) {
a[i] = rand();
b[i] = rand();
}
for (int j = 0; j < arraySize; j++) {
c[j] = a[j] + b[j];
}
long sum = 0;
for (int k = 0; k < arraySize; k++) {
sum += c[k];
}
printf("%d\n", sum);
free(a);
free(b);
free(c);
return 0;
}
I realize the three loops could have been one loop, but the specific wording in the assignment suggested the operations would have to be done in a sequential order so I implemented it this way on purpose. I suspect that it may also be easier to recognized as vectorizable by the compiler this way, but I will have to test some more on that later.
Note the assignment did not ask for printing the result, but I was worried the compiler might get rid of my code if I didn't do anything with the result and it is nice to have some confirmation that the program worked.
We will compile this code on the arch64 system we used in the previous lab called aarchie in order to have a look at some arch64 vectorization assembly language.
The compiler command we run is:
gcc -x c vectorization.C -o vectorization -O3 -std=c99 -gThe -std=99 option is necessary on aarchie because it has the 4.8.3 version of the gcc compiler and c89 is still default there. -O3 is to turn the optimizations we are looking for on. The -x c is to specify we want it compiled in C, for some reason on this code gcc was assuming I wanted to compile it as c++ and giving me back error messages, this solved that problem. -o is to specify the output filename. -g is to specify debug information.
I am going to look at our code using the objdump -d command, note that because we specified -O3 we gave the compiler permission to toss around our code making the assembly substantially harder to read. Because our code is so garbled up I am going to look for some excerpts showing that auto vectorization kicked in.
The first vectorization I found is:
40071c: 4cdf7441 ld1 {v1.8h}, [x2], #16
400720: 4cdf7460 ld1 {v0.8h}, [x3], #16
Here we are loading part of our first and second array in at the top, and using an incremental immediate offset of 16 to jump past the section we just did in the next iteration. (loading and incrementing the offset in one instruction).
400724: 4e608420 add v0.8h, v1.8h, v0.8h
We then call the add on the vector registers, storing the result in vector register 0.
400728: 11000400 add w0, w0, #0x1This add instruction is for the loop control variable. The compiler chose to do this before storing the result so the processor gets some time to finish this operation before calling compare.
40072c: 4c9f7420 st1 {v0.8h}, [x1], #16Here we are storing our result into a memory location pointed to by x1, also using an incremental immediate offset.
400730: 6b04001f cmp w0, w4The compare and branch back to the top if we aren't finished yet. Note that the optional incremental immediate offsets are saving us a lot of instructions otherwise spent in memory management here.
400734: 54ffff43 b.cc 40071c <main+0x13c>
The second part we expected to be vectorized was the summation of the third array, that is found in this section:
4008b8: 4cdf7420 ld1 {v0.8h}, [x1], #16Here we load the array into vector register 0, note it is still in the adress pointed to by x1 where it was stored in the previous section, we are also still using the incremental immediate offset.
4008bc: 0f10a401 sxtl v1.4s, v0.4hAbove we see a somewhat hard to read mixture of additions and lengthening instructions. The add and compare instructions on the regular registers are the loop control variables. I think the reason the addition and lengthening instructions are so intermixed is because the compiler is reusing the original vector register(v0) that already contains the right values as much as possible. This way there is no need to move the originally loaded values around as much.
4008c0: 0f20a423 sxtl v3.2d, v1.2s
4008c4: 4f10a400 sxtl2 v0.4s, v0.8h
4008c8: 4f20a421 sxtl2 v1.2d, v1.4s
4008cc: 4ee28462 add v2.2d, v3.2d, v2.2d
4008d0: 4ee28422 add v2.2d, v1.2d, v2.2d
4008d4: 11000400 add w0, w0, #0x1
4008d8: 0f20a401 sxtl v1.2d, v0.2s
4008dc: 4ee28422 add v2.2d, v1.2d, v2.2d
4008e0: 4f20a400 sxtl2 v0.2d, v0.4s
4008e4: 6b00007f cmp w3, w0
4008e8: 4ee28402 add v2.2d, v0.2d, v2.2d
4008ec: 54fffe68 b.hi 4008b8 <main+0x2d8>
Something interesting to note here is that I am not sure why the compiler is lengthening the values to 64 bit signed integer, when a long should require only 32 bits. The compiler is also not making use of the vector reduce operation which would be the logical operation to use in this situation. That last part may be explained by the choice for 64 bit integers, because reducing 2 values is probably less efficient than vector addition. Furthermore, the compiler did all this vectorization without us providing it any hints so far, I will try to see if I can fix the second part with some hints in the code later on. I am glad that it was reasonably easy to find the vector instructions even with the O3 optimization level.
Lastly we were asked to look for a way to vectorize our sound volume program from last lab. Vector registers don't allow for operations across multiple datatypes so we have to think of something to convert our 32 bit float.
In the arm overview I found the following instruction:
FCVTxS Vd.<T>, Vn.<T>This allows us to use vectorization to convert our 32 bit floats into 32 bit signed integers.
We can then promote our 16 bit signed integer samples to 32 bit signed integers using the vectorized lengthening instructions:
Because the data are now the same type and length we can now use the following instruction to multiply the two:SXTL Vd.<Td>, Vn.<Ts>SXTL2 Vd.<Td>, Vn.<Ts>
FMUL Vd.<T>, Vn.<T>, Vm.<Ts>[index]I believe these instructions should lead to a valid vectorized sound multiplying program. Whether it is the most optimal solution I am not sure and I will have to look at this some more later.
That's it for now! I will revisit some of this later due to multiple things I am not 100% sure about at the moment.
Geen opmerkingen:
Een reactie posten