SPO 600 - Lab 4

In Lab 4, I will be looking at Single Instruction Multiple Data (SIMD) and Auto-Vectorization. Simply put, vectorization is to process one operation on multiple pairs of operands at the same time, to make the program faster.

For this lab, I will have to write a C program that creates two 1000-element integer arrays, fill them in with random numbers within the range -1000 to +1000, sums these two arrays element-by-element into a third array, calculate the sum of the third array, then display it. I will test the program in Aarch64 environment.

Here is the source code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(){
 const int SIZE = 1000;
 int a[SIZE], b[SIZE], c[SIZE];
 int min = -1000, max = 1000;

 srand(time(NULL));
 for(int i = 0; i < SIZE; i++){
  a[i] = rand()%(max+1-min);
  b[i] = rand()%(max+1-min);
 }
 int sum = 0;
 for(int j = 0; j < SIZE; j++){
  c[j] = a[j] + b[j];
  sum += c[j]; 
 }
 printf("Sum = %d\n", sum);
}


Then I compile the program using gcc with auto-vectorization enabled. The command is:

gcc -O3 -o lab4 lab4.c (Note: Vectorization is enabled by default at -O3)

Using objdump, I get the disassembly of the <main> section of my program, and I can see the auto-vectorization is doing its job (I bolded those instructions, just so it is easier to read):

0000000000400560 <main>:
//initialize the three arrays 
  400560:    a9bc7bfd     stp    x29, x30, [sp, #-64]!
  400564:    d2800000     mov    x0, #0x0                       // #0
  400568:    910003fd     mov    x29, sp
  40056c:    a90153f3     stp    x19, x20, [sp, #16]
  400570:    529a9c74     mov    w20, #0xd4e3                    // #54499
  400574:    a9025bf5     stp    x21, x22, [sp, #32]
  400578:    72a83014     movk    w20, #0x4180, lsl #16
  40057c:    a90363f7     stp    x23, x24, [sp, #48]
  400580:    d13ec3ff     sub    sp, sp, #0xfb0
  400584:    910003f5     mov    x21, sp
  400588:    d13ec3ff     sub    sp, sp, #0xfb0
  40058c:    910003f6     mov    x22, sp
  400590:    d13ec3ff     sub    sp, sp, #0xfb0
  400594:    d2800018     mov    x24, #0x0                       // #0
  400598:    910003f7     mov    x23, sp
  40059c:    5280fa33     mov    w19, #0x7d1                     // #2001
//calling the time, srand and rand subroutines to generate the random numbers
  4005a0:    97ffffd4     bl    4004f0 <time@plt>
  4005a4:    97ffffe7     bl    400540 <srand@plt>
  4005a8:    97ffffda     bl    400510 <rand@plt>
  4005ac:    9b347c01     smull    x1, w0, w20
  4005b0:    9369fc21     asr    x1, x1, #41
  4005b4:    4b807c21     sub    w1, w1, w0, asr #31
  4005b8:    1b138020     msub    w0, w1, w19, w0
  4005bc:    b8386aa0     str    w0, [x21, x24]
  4005c0:    97ffffd4     bl    400510 <rand@plt>
  4005c4:    9b347c01     smull    x1, w0, w20
  4005c8:    9369fc21     asr    x1, x1, #41
  4005cc:    4b807c21     sub    w1, w1, w0, asr #31
  4005d0:    1b138020     msub    w0, w1, w19, w0
  4005d4:    b8386ac0     str    w0, [x22, x24]
//increment the loop counter
  4005d8:    91001318     add    x24, x24, #0x4
//check if loop counter equals to 1000, if not then continue
  4005dc:    f13e831f     cmp    x24, #0xfa0
  4005e0:    54fffe41     b.ne    4005a8 <main+0x48>  // b.any
  4005e4:    cb550be0     neg    x0, x21, lsr #2
  4005e8:    72000400     ands    w0, w0, #0x3
  4005ec:    54000a80     b.eq    40073c <main+0x1dc>  // b.none
  4005f0:    b94002a1     ldr    w1, [x21]
  4005f4:    7100041f     cmp    w0, #0x1
  4005f8:    b94002c8     ldr    w8, [x22]
  4005fc:    0b080028     add    w8, w1, w8
  400600:    b90002e8     str    w8, [x23]
  400604:    54000960     b.eq    400730 <main+0x1d0>  // b.none
  400608:    b94006a1     ldr    w1, [x21, #4]
  40060c:    71000c1f     cmp    w0, #0x3
  400610:    b94006c2     ldr    w2, [x22, #4]
  400614:    0b020021     add    w1, w1, w2
  400618:    b90006e1     str    w1, [x23, #4]
  40061c:    0b010108     add    w8, w8, w1
  400620:    54000961     b.ne    40074c <main+0x1ec>  // b.any
  400624:    b9400aa2     ldr    w2, [x21, #8]
  400628:    2a0003e6     mov    w6, w0
  40062c:    b9400ac1     ldr    w1, [x22, #8]
  400630:    52807ca9     mov    w9, #0x3e5                     // #997
  400634:    0b020021     add    w1, w1, w2
  400638:    b9000ae1     str    w1, [x23, #8]
  40063c:    0b010108     add    w8, w8, w1
  400640:    4f000401     movi    v1.4s, #0x0
  400644:    d37e0402     ubfiz    x2, x0, #2, #2
  400648:    52807d07     mov    w7, #0x3e8                     // #1000
  40064c:    4b0000e7     sub    w7, w7, w0
  400650:    8b0202a4     add    x4, x21, x2
  400654:    8b0202c3     add    x3, x22, x2
  400658:    53027ce5     lsr    w5, w7, #2
  40065c:    8b0202e2     add    x2, x23, x2
  400660:    d2800000     mov    x0, #0x0                       // #0
  400664:    52800001     mov    w1, #0x0                       // #0
  400668:    3ce06860     ldr    q0, [x3, x0]
  40066c:    11000421     add    w1, w1, #0x1
  400670:    3ce06882     ldr    q2, [x4, x0]
  400674:    6b0100bf     cmp    w5, w1
  //add the two arrays to the 3rd, and calculate the sum 
  400678:    4ea28400     add    v0.4s, v0.4s, v2.4s
  40067c:    4ea08421     add    v1.4s, v1.4s, v0.4s
  400680:    3ca06840     str    q0, [x2, x0]
  400684:    91004000     add    x0, x0, #0x10
  400688:    54ffff08     b.hi    400668 <main+0x108>  // b.pmore
  40068c:    4eb1b821     addv    s1, v1.4s
  400690:    121e74e1     and    w1, w7, #0xfffffffc
  400694:    0b060020     add    w0, w1, w6
  400698:    4b010123     sub    w3, w9, w1
  40069c:    6b0100ff     cmp    w7, w1
  4006a0:    0e043c21     mov    w1, v1.s[0]
  4006a4:    0b080021     add    w1, w1, w8
  4006a8:    54000300     b.eq    400708 <main+0x1a8>  // b.none
  4006ac:    93407c05     sxtw    x5, w0
  4006b0:    11000402     add    w2, w0, #0x1
  4006b4:    7100047f     cmp    w3, #0x1
  4006b8:    b8657aa4     ldr    w4, [x21, x5, lsl #2]
  4006bc:    b8657ac6     ldr    w6, [x22, x5, lsl #2]
  4006c0:    0b060084     add    w4, w4, w6
  4006c4:    b8257ae4     str    w4, [x23, x5, lsl #2]
  4006c8:    0b040021     add    w1, w1, w4
  4006cc:    540001e0     b.eq    400708 <main+0x1a8>  // b.none
  4006d0:    93407c42     sxtw    x2, w2
  4006d4:    7100087f     cmp    w3, #0x2
  4006d8:    11000800     add    w0, w0, #0x2
  4006dc:    b8627aa3     ldr    w3, [x21, x2, lsl #2]
  4006e0:    b8627ac4     ldr    w4, [x22, x2, lsl #2]
  4006e4:    0b040063     add    w3, w3, w4
  4006e8:    b8227ae3     str    w3, [x23, x2, lsl #2]
  4006ec:    0b030021     add    w1, w1, w3
  4006f0:    540000c0     b.eq    400708 <main+0x1a8>  // b.none
  4006f4:    93407c00     sxtw    x0, w0
  4006f8:    b8607aa2     ldr    w2, [x21, x0, lsl #2]
  4006fc:    b8607ac0     ldr    w0, [x22, x0, lsl #2]
  400700:    0b000040     add    w0, w2, w0
  400704:    0b000021     add    w1, w1, w0
  400708:    90000000     adrp    x0, 400000 <_init-0x4b8>
  40070c:    91246000     add    x0, x0, #0x918
//printing the sum of 3rd array
  400710:    97ffff90     bl    400550 <printf@plt>
  400714:    910003bf     mov    sp, x29
  400718:    52800000     mov    w0, #0x0                       // #0
  40071c:    a94153f3     ldp    x19, x20, [sp, #16]
  400720:    a9425bf5     ldp    x21, x22, [sp, #32]
  400724:    a94363f7     ldp    x23, x24, [sp, #48]
  400728:    a8c47bfd     ldp    x29, x30, [sp], #64
  40072c:    d65f03c0     ret
  400730:    52807ce9     mov    w9, #0x3e7                     // #999
  400734:    2a0003e6     mov    w6, w0
  400738:    17ffffc2     b    400640 <main+0xe0>
  40073c:    52807d09     mov    w9, #0x3e8                     // #1000
  400740:    52800008     mov    w8, #0x0                       // #0
  400744:    52800006     mov    w6, #0x0                       // #0
  400748:    17ffffbe     b    400640 <main+0xe0>
  40074c:    52807cc9     mov    w9, #0x3e6                     // #998
  400750:    52800046     mov    w6, #0x2                       // #2
  400754:    17ffffbb     b    400640 <main+0xe0>

According to the page 18 of the great ARMv8 Instruction Set Overview (Yes, it is just an Overview. Considering the overview is already 112 pages long, I would imagine the actual guide is thicker than the Encyclopædia Britannica.), my program is using three 32 bits SIMD vector registers (with four lanes) to store my three arrays - v0.4s, v1.4s, and v2.4s. If you are wondering what does "v1.s[0]" mean, it simply means accessing a single SIMD vector element, according to the aforementioned Overview.

Reflection

Overall I would say it is an interesting lab. Writing the program itself is not hard (I finished the program in 5 minutes or so), but inspecting the disassembly code and prove it was vectorized were hardly something easy to do.

Comments

Popular posts from this blog

SPO 600 Project - Stage 1

SPO 600 - Lab 1

SPO 600 - Lab 5