SPO 600 - Lab 6
In this lab, I would be exploring the use of inline assembler and its use in open source software. For more information, click here.
Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ time ./vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -454
Time spent: 0.000670
real 0m0.027s
user 0m0.027s
sys 0m0.000s
Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -142
Time spent: 0.001029
real 0m0.028s
user 0m0.028s
sys 0m0.000s
I adjust the number of samples to 500000000, and vol_simd is still faster.
Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ make && time ./vol_simd
gcc -g -O3 vol_simd.c -o vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -751
Time spent: 0.826959
real 0m25.909s
user 0m24.816s
sys 0m1.079s
Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -606
Time spent: 1.068387
real 0m25.942s
user 0m24.828s
sys 0m1.098s
//Q: what is an alternate approach?
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t
A: Let the compiler to choose which registers to use.
// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);
A: 32767, because the range of int16_t is -32,768 to 32,767.
// Q: what happens if we remove the following
// two lines? Why?
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
);
A: It can't even compile, because the compiler doesn't know where to get the input and where to output the result.
// Q: are the results usable? are they correct?
A: In aarch64 platform, the result are usable, and they're correct.
In cxxtools, I can't find too many assembly-language code, though I still managed to find some inline assembly code for the atomicity component. It would choose the proper inline assembly version of it based on the platform, e.g. arm, x86, x86_64, MIPS, etc.
Considering it also has a platform-independent version (written in pure C++), I assume it is done to improve its performance whenever possible.
My personal opinion on the inline assembly is even though inline assembly is very complicated (even more than pure assembly), I do acknowledge their existence is necessary - after all, compiler optimization isn't always produce the best result, and they sometimes break the software (looking at you, -Ofast).
Part A
For comparison, I tested the performance of vol_simd (the version with inline assembly and SIMD) and vol3 (the version with no inline assembly, which I built in lab 5) on aarch64 (well, the inline assembly version is written in aarch64 assembly), and I noticed vol_simd is a bit faster than its pure-C counterparts.Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ time ./vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -454
Time spent: 0.000670
real 0m0.027s
user 0m0.027s
sys 0m0.000s
Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -142
Time spent: 0.001029
real 0m0.028s
user 0m0.028s
sys 0m0.000s
I adjust the number of samples to 500000000, and vol_simd is still faster.
Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ make && time ./vol_simd
gcc -g -O3 vol_simd.c -o vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -751
Time spent: 0.826959
real 0m25.909s
user 0m24.816s
sys 0m1.079s
Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -606
Time spent: 1.068387
real 0m25.942s
user 0m24.828s
sys 0m1.098s
//Q: what is an alternate approach?
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t
A: Let the compiler to choose which registers to use.
// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);
A: 32767, because the range of int16_t is -32,768 to 32,767.
// Q: what happens if we remove the following
// two lines? Why?
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
);
A: It can't even compile, because the compiler doesn't know where to get the input and where to output the result.
// Q: are the results usable? are they correct?
A: In aarch64 platform, the result are usable, and they're correct.
Part B
The package I chose is cxxtools, a generic C++ library that is part of the C++ web application framework Tntnet.In cxxtools, I can't find too many assembly-language code, though I still managed to find some inline assembly code for the atomicity component. It would choose the proper inline assembly version of it based on the platform, e.g. arm, x86, x86_64, MIPS, etc.
Considering it also has a platform-independent version (written in pure C++), I assume it is done to improve its performance whenever possible.
My personal opinion on the inline assembly is even though inline assembly is very complicated (even more than pure assembly), I do acknowledge their existence is necessary - after all, compiler optimization isn't always produce the best result, and they sometimes break the software (looking at you, -Ofast).
Comments
Post a Comment