Ilkyu Song - SPO600: March 2018

Sunday, 18 March 2018

SPO600 Project - Stage 1

REDIS (Remote Dictionary Server)

I chose the Redis open source package for the SPO600 Project Stage1.
Redis (Remote Dictionary Server) is a kind of NoSQL for storing and managing non-relational data of 'key-value' structure. It was first developed by Salvatore Sanfilippo in 2009. Redis Labs has been supporting it since 2015. It is a memory-based DBMS that loads and processes all data into memory. Moreover, Redis follows the BSD license. According to DB-Engines.com's monthly rankings, Redis is the most popular key-value store.

Redis supports various data types.

String: Supports up to 512 Mbytes in length as a regular string. It can store not only Text strings, but also binary files such as Integer and numbers and JPEGs.

Sets: Redis is a random ordered set of strings. Set can operate intersection, difference of set, etc., but it is very fast.

Sorted Set: A data type with a field named "Score" added to the Set. This is similar to sets.

Sorted sets are very fast to add, delete, and update elements.

Hashes: A data type that can store a pair of field/string values within a value.

List: A simple list of strings. They are sorted by the input order.

First of all, I had to install Redis. Learn how to install and run. The installation files and how to run them are on the Redis site. Redis is also open source and can be downloaded from GitHub.

Redis site: https://redis.io/

Github: https://github.com/antirez/redis

$ wget http://download.redis.io/releases/redis-4.0.8.tar.gz
$ tar xzf redis-4.0.8.tar.gz
$ cd redis-4.0.8
$ make

$ src/redis-server
$ src/redis-cli

Let's test a simple key-value.

In the above image, you can see that Redis is installed normally. If so, let's try measuring the performance of Redis. Redis performance measurements are provided by Redis itself. The performance of all software depends on the specifications of the server. So let's first look at the specification of the server.

First, let's check the processing speed according to the amount of data. The unit is bytes. Up to 10000 bytes, the data processing speed does not show much difference, but it is sharply decreasing.

Second, the time taken to process 1000000 commands was measured for each function. We can see that it takes similar processing time except Get function

I tested separately the Hset function which stored by the Hash field. Because the Hset function will be tested in detail on Stage2.

Monday, 12 March 2018

Lab6 - Inline Assembler

Part1.

The inline assembler is an assembler that allows the use of assembly language at the language level and translates the assembly language between the codes into the machine language when the compiler translates the source code. In other words, it refers to using assembler commands directly in high-level languages. In this lab, we will be testing the inline assembler. First, run the example source on the aarchie server.

// vol_simd.c :: volume scaling in C using AArch64 SIMD
// Chris Tyler 2017.11.29-2018.02.20

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

    int16_t*        in;        // input array
    int16_t*        limit;        // end of input array
    int16_t*        out;        // output array

    // these variables will be used in our assembler code, so we're going
    // to hand-allocate which register they are placed in
    // Q: what is an alternate approach?
    register int16_t*    in_cursor     asm("r20");    // input cursor
    register int16_t*    out_cursor    asm("r21");    // output cursor
    register int16_t    vol_int        asm("r22");    // volume as int16_t

    int            x;        // array interator
    int            ttl;        // array total

    in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
    out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

    srand(-1);
    printf("Generating sample data.\n");
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand()%65536)-32768;
    }

// --------------------------------------------------------------------

    in_cursor = in;
    out_cursor = out;
    limit = in + SAMPLES ;

    // set vol_int to fixed-point representation of 0.75
    // Q: should we use 32767 or 32768 in next line? why?
    vol_int = (int16_t) (0.75 * 32767.0);

    printf("Scaling samples.\n");

    // Q: what does it mean to "duplicate" values in the next line?
    __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

    while ( in_cursor < limit ) {
        __asm__ (
            "ldr q0, [%[in]],#16        \n\t"
            // load eight samples into q0 (v0.8h)
            // from in_cursor, and post-increment
            // in_cursor by 16 bytes

            "sqdmulh v0.8h, v0.8h, v1.8h    \n\t"
            // multiply each lane in v0 by v1*2
            // saturate results
            // store upper 16 bits of results into v0
            
            "str q0, [%[out]],#16        \n\t"
            // store eight samples to out_cursor
            // post-increment out_cursor by 16 bytes

            // Q: what happens if we remove the following
            // two lines? Why?
            : [in]"+r"(in_cursor)
            : "0"(in_cursor),[out]"r"(out_cursor)
            );
    }

// --------------------------------------------------------------------

    printf("Summing samples.\n");
    for (x = 0; x < SAMPLES; x++) {
        ttl=(ttl+out[x])%1000;
    }

    // Q: are the results usable? are they correct?
    printf("Result: %d\n", ttl);

    return 0;

}

The image below is the result of running the above code.

Now, let's answer the questions in the code.

Q: what is an alternate approach?
An alternative approach is not to assign a value to an object, but to allow the object to recognize the value. Therefore, declare a register variable without assigning a value.

Q: should we use 32767 or 32768 in next line? why?
The range of integer values is -32,768 to 32,767. Therefore, we must use the value 32,767.

Q: what does it mean to "duplicate" values in the next line?
It means to store the value of vol_int variable in v1.8h.

Q: what happens if we remove the following two lines? Why?
If these two lines are erased, a segment fault will occur.

Q: are the results usable? are they correct?
Yes, the results are correct and usable.

The assembly is 1:1 matched with the machine language. The C language combines several CPU instructions into a single statement. Eventually, it can be seen to the extent that you can easily express the part that is a little handy when you implement the assembly.
Thus, C language is called low-level language and more machine-friendly. If you think about common sense, if you are only good at optimizing it, it will be almost similar to what you've coded into the assembly. It's a good expression for creating programs like speed-critical operating systems.

Sunday, 4 March 2018

Lab5 - Algorithm Selection

In this lab, we will look at how algorithms affect program execution and compile time. To understand this, I will use digital sound as an example. First, calculate the execution time using the Clock object in the sample code on aarchie server.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
    return (int16_t) (volume_factor * (float) sample);
}

int main() {

    // Allocate memory for large in and out arrays
    int16_t*    in;
    int16_t*    out;
    in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
    out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

    int        x;
    int        ttl;

    clock_t start, end;
    float ftime;

    // Seed the pseudo-random number generator
    srand(-1);

    // Fill the array with random data
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand()%65536)-32768;
    }

    start = clock();

    // ######################################
    // This is the interesting part!
    // Scale the volume of all of the samples
    for (x = 0; x < SAMPLES; x++) {
        out[x] = scale_sample(in[x], 0.75);
    }
    // ######################################

    end = clock();

    // Sum up the data
    for (x = 0; x < SAMPLES; x++) {
        ttl = (ttl+out[x])%1000;
    }

    ftime = (float)(end - start) / CLOCKS_PER_SEC;

    // Print the sum
    printf("Result: %d\n", ttl);
    printf("Runing Time: %f sec.\n", ftime);

    return 0;

}

The result obtained by executing the above source is shown in the figure below.

Now let's expand the source and test it. The second test is as follows.
1. Pre-calculate a lookup table (array) of all possible sample values multiplied by the volume factor, and look up each sample to get the scaled values.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
    return (int16_t) (volume_factor * (float) sample);
}

int main() {

    // Allocate memory for large in and out arrays
    int16_t*    in;
    int16_t*    out;
    in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
    out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

    int16_t* lookUp = calloc(65536, sizeof(int16_t));

    int        x;
    int        ttl;

    clock_t start, end;
    float ftime;

    // Seed the pseudo-random number generator
    srand(-1);

    // Fill the array with random data
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand()%65536)-32768;
    }

    for (x = 0; x < 65536; x++) {
        lookUp[x] = (x - 32768) * 0.75;
    }

    start = clock();

    // ######################################
    // This is the interesting part!
    // Scale the volume of all of the samples
    for (x = 0; x < SAMPLES; x++) {
        out[x] = (lookUp[in[x] + 32768]);
    }
    // ######################################

    // Sum up the data
    for (x = 0; x < SAMPLES; x++) {
        ttl = (ttl+out[x])%1000;
    }

    end = clock();

    ftime = (float)(end - start) / CLOCKS_PER_SEC;

    // Print the sum
    printf("Result: %d\n", ttl);
    printf("Runing Time: %f\n sec.", ftime);

    return 0;

}

Finally, the test contents are as follows.
2. Convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value "1". For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you're using 256 as the multiplier).

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
    return (int16_t)(volume_factor * (float)sample);
}

int main() {

    // Allocate memory for large in and out arrays
    int16_t*    in;
    int16_t*    out;
    in = (int16_t*)calloc(SAMPLES, sizeof(int16_t));
    out = (int16_t*)calloc(SAMPLES, sizeof(int16_t));

    int16_t* lookUp = calloc(65536, sizeof(int16_t));

    int        x;
    int        ttl;

    clock_t start, end;
    float ftime;

    // Seed the pseudo-random number generator
    srand(-1);

    // Fill the array with random data
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand() % 65536) - 32768;
    }

    start = clock();

    // ######################################
    // This is the interesting part!
    // Scale the volume of all of the samples
    for (x = 0; x < SAMPLES; x++) {
        out[x] = (int16_t)(in[x] * 0.75 * 256) >> 8;
    }
    // ######################################

    end = clock();

    // Sum up the data
    for (x = 0; x < SAMPLES; x++) {
        ttl = (ttl + out[x]) % 1000;
    }

    ftime = (float)(end - start) / CLOCKS_PER_SEC;

    // Print the sum
    printf("Result: %d\n", ttl);
    printf("Runing Time: %f\n sec.", ftime);

    return 0;

}

Let's compare the three tests at the same time.

Comparing the three tests, the second test was the slowest. The third test is not much different from the first. The third is the most optimized source.

Ilkyu Song - SPO600