dinsdag 29 maart 2016

Project Selection part 1

We've started our projects and I have selected slim, a compression algorithm that mainly deals with very large integers of 1,2 or 4-byte word size. One of the things I noticed immediately was that it had not been updated for a while, it was written in 2008 and some additions were made in 2013. Right now it is a one man project as all the contributions were made by a single person.

I knew it was not going to be easy to find documentation because there was only one developer and the last he worked on it was 3 years ago. Undeterred I installed the package on both aarchie and my own Fedora installation. The package can be found here: https://sourceforge.net/projects/slimdata/
The installation was very straightforward and just required a tar -xvf, a ./configure with some parameters to be found in the included README and a make install. I could now slim and unslim files, I tried it on a few text files and it worked. However the program clearly stated it was made for large integer groups with repetitive patterns, so I set out to create a program that would make a file for me. After much experimentation I finished with this:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main() {
    srand(time(NULL));
    //ints in a GB
    int64_t size = 1073741824/4;
    int repetition = 20000;
    int32_t* data = (int32_t*) malloc(sizeof(int32_t)*size);
    int i = 0;
    while (i < size) {
        int32_t number = rand();
        //repeat j times
        for (int j = 0; j < repetition; j++) {
            if (i < size) {
                data[i] = number;
                i++;
            }
        }
    }
    FILE *fp = fopen("data", "w");
    fwrite(data, 4, size, fp);
    //for(int x = 0; x<size; x++){
    //fprintf(fp, "%d", data[x]);
    //}
    fclose(fp);
    printf("success");
    return 0;
}
I have tried using fwrite, to just write the binary values of the integers to a file. I have tried fprintf, to write the text values of the integers to a file. I have tried random repetition amounts and swapping around some integers to create a more or less diverse file. I have tried smaller and larger files. The text files would not even slim, it would literally break the original file into an unrecoverable state. The integer files would slim but the maximum compression file gain was 1,5% of the file or about 20MB on a 1GB file.
I think I can safely assume that this is not the intended gain on a file of this or any size as gzip would win about 50% of the file size in about the same amount of time.

It saddens me to say that I think that I will have to admit defeat here and pick a different project. I will soon update on that.

Geen opmerkingen:

Een reactie posten