SPO 600 - Lab 6

In this lab, I would be exploring the use of inline assembler and its use in open source software. For more information, click here.

Part A

For comparison, I tested the performance of vol_simd (the version with inline assembly and SIMD) and vol3 (the version with no inline assembly, which I built in lab 5) on aarch64 (well, the inline assembly version is written in aarch64 assembly), and I noticed vol_simd is a bit faster than its pure-C counterparts.

Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ time ./vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -454
Time spent: 0.000670

real    0m0.027s
user    0m0.027s
sys    0m0.000s

Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -142
Time spent: 0.001029

real    0m0.028s
user    0m0.028s
sys    0m0.000s



I adjust the number of samples to 500000000, and vol_simd is still faster.

Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ make && time ./vol_simd
gcc -g -O3 vol_simd.c -o vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -751
Time spent: 0.826959

real    0m25.909s
user    0m24.816s
sys    0m1.079s

Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -606
Time spent: 1.068387

real    0m25.942s
user    0m24.828s
sys    0m1.098s

//Q: what is an alternate approach?
        register int16_t*       in_cursor       asm("r20");     // input cursor
        register int16_t*       out_cursor      asm("r21");     // output cursor
        register int16_t        vol_int         asm("r22");     // volume as int16_t
A: Let the compiler to choose which registers to use.

// set vol_int to fixed-point representation of 0.75
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t) (0.75 * 32767.0);
A: 32767, because the range of int16_t is -32,768 to 32,767.

// Q: what happens if we remove the following
                        // two lines? Why?
                        : [in]"+r"(in_cursor)
                        : "0"(in_cursor),[out]"r"(out_cursor)
                        );
A: It can't even compile, because the compiler doesn't know where to get the input and where to output the result.

// Q: are the results usable? are they correct?
A: In aarch64 platform, the result are usable, and they're correct.

Part B

The package I chose is cxxtools, a generic C++ library that is part of the C++ web application framework Tntnet.

In cxxtools, I can't find too many assembly-language code, though I still managed to find some inline assembly code for the atomicity component. It would choose the proper inline assembly version of it based on the platform, e.g. arm, x86, x86_64, MIPS, etc.

Considering it also has a platform-independent version (written in pure C++), I assume it is done to improve its performance whenever possible.
My personal opinion on the inline assembly is even though inline assembly is very complicated (even more than pure assembly), I do acknowledge their existence is necessary - after all, compiler optimization isn't always produce the best result, and they sometimes break the software (looking at you, -Ofast).

Reflection

Overall inline assembly is an interesting (sometimes necessary) piece of technology to work with, but personally I would rather compile it with -O2 or -O3 instead.



Comments

Popular posts from this blog

SPO 600 Project - Stage 1

SPO 600 - Lab 5

SPO 600 - Lab 1