SPO 600 - Lab 4

In Lab 4, I will be looking at Single Instruction Multiple Data (SIMD) and Auto-Vectorization. Simply put, vectorization is to process one operation on multiple pairs of operands at the same time, to make the program faster.

For this lab, I will have to write a C program that creates two 1000-element integer arrays, fill them in with random numbers within the range -1000 to +1000, sums these two arrays element-by-element into a third array, calculate the sum of the third array, then display it. I will test the program in Aarch64 environment.

Here is the source code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(){
 const int SIZE = 1000;
 int a[SIZE], b[SIZE], c[SIZE];
 int min = -1000, max = 1000;

 srand(time(NULL));
 for(int i = 0; i < SIZE; i++){
  a[i] = rand()%(max+1-min);
  b[i] = rand()%(max+1-min);
 }
 int sum = 0;
 for(int j = 0; j < SIZE; j++){
  c[j] = a[j] + b[j];

  sum += c[j];

 }
 printf("Sum = %d\n", sum);
}

Then I compile the program using gcc with auto-vectorization enabled. The command is:

gcc -O3 -o lab4 lab4.c (Note: Vectorization is enabled by default at -O3)

Using objdump, I get the disassembly of the <main> section of my program, and I can see the auto-vectorization is doing its job (I bolded those instructions, just so it is easier to read):

0000000000400560 <main>:
//initialize the three arrays
400560:    a9bc7bfd     stp    x29, x30, [sp, #-64]!
400564:    d2800000     mov    x0, #0x0                       // #0
400568:    910003fd     mov    x29, sp
40056c:    a90153f3     stp    x19, x20, [sp, #16]
400570:    529a9c74     mov    w20, #0xd4e3                    // #54499
400574:    a9025bf5     stp    x21, x22, [sp, #32]
400578:    72a83014     movk    w20, #0x4180, lsl #16
40057c:    a90363f7     stp    x23, x24, [sp, #48]
400580:    d13ec3ff     sub    sp, sp, #0xfb0
400584:    910003f5     mov    x21, sp
400588:    d13ec3ff     sub    sp, sp, #0xfb0
40058c:    910003f6     mov    x22, sp
400590:    d13ec3ff     sub    sp, sp, #0xfb0
400594:    d2800018     mov    x24, #0x0                       // #0
400598:    910003f7     mov    x23, sp
40059c:    5280fa33     mov    w19, #0x7d1                     // #2001
//calling the time, srand and rand subroutines to generate the random numbers
4005a0:    97ffffd4     bl    4004f0 <time@plt>
4005a4:    97ffffe7     bl    400540 <srand@plt>
4005a8:    97ffffda     bl    400510 <rand@plt>
4005ac:    9b347c01     smull    x1, w0, w20
4005b0:    9369fc21     asr    x1, x1, #41
4005b4:    4b807c21     sub    w1, w1, w0, asr #31
4005b8:    1b138020     msub    w0, w1, w19, w0
4005bc:    b8386aa0     str    w0, [x21, x24]
4005c0:    97ffffd4     bl    400510 <rand@plt>
4005c4:    9b347c01     smull    x1, w0, w20
4005c8:    9369fc21     asr    x1, x1, #41
4005cc:    4b807c21     sub    w1, w1, w0, asr #31
4005d0:    1b138020     msub    w0, w1, w19, w0
4005d4:    b8386ac0     str    w0, [x22, x24]
//increment the loop counter
4005d8:    91001318     add    x24, x24, #0x4
//check if loop counter equals to 1000, if not then continue
4005dc:    f13e831f     cmp    x24, #0xfa0
4005e0:    54fffe41     b.ne    4005a8 <main+0x48> // b.any
4005e4:    cb550be0     neg    x0, x21, lsr #2
4005e8:    72000400     ands    w0, w0, #0x3
4005ec:    54000a80     b.eq    40073c <main+0x1dc> // b.none
4005f0:    b94002a1     ldr    w1, [x21]
4005f4:    7100041f     cmp    w0, #0x1
4005f8:    b94002c8     ldr    w8, [x22]
4005fc:    0b080028     add    w8, w1, w8
400600:    b90002e8     str    w8, [x23]
400604:    54000960     b.eq    400730 <main+0x1d0> // b.none
400608:    b94006a1     ldr    w1, [x21, #4]
40060c:    71000c1f     cmp    w0, #0x3
400610:    b94006c2     ldr    w2, [x22, #4]
400614:    0b020021     add    w1, w1, w2
400618:    b90006e1     str    w1, [x23, #4]
40061c:    0b010108     add    w8, w8, w1
400620:    54000961     b.ne    40074c <main+0x1ec> // b.any
400624:    b9400aa2     ldr    w2, [x21, #8]
400628:    2a0003e6     mov    w6, w0
40062c:    b9400ac1     ldr    w1, [x22, #8]
400630:    52807ca9     mov    w9, #0x3e5                     // #997
400634:    0b020021     add    w1, w1, w2
400638:    b9000ae1     str    w1, [x23, #8]
40063c:    0b010108     add    w8, w8, w1
400640:    4f000401     movi    v1.4s, #0x0
400644:    d37e0402     ubfiz    x2, x0, #2, #2
400648:    52807d07     mov    w7, #0x3e8                     // #1000
40064c:    4b0000e7     sub    w7, w7, w0
400650:    8b0202a4     add    x4, x21, x2
400654:    8b0202c3     add    x3, x22, x2
400658:    53027ce5     lsr    w5, w7, #2
40065c:    8b0202e2     add    x2, x23, x2
400660:    d2800000     mov    x0, #0x0                       // #0
400664:    52800001     mov    w1, #0x0                       // #0
400668:    3ce06860     ldr    q0, [x3, x0]
40066c:    11000421     add    w1, w1, #0x1
400670:    3ce06882     ldr    q2, [x4, x0]
400674:    6b0100bf     cmp    w5, w1
//add the two arrays to the 3rd, and calculate the sum
400678:    4ea28400     add    v0.4s, v0.4s, v2.4s
40067c:    4ea08421     add    v1.4s, v1.4s, v0.4s
400680:    3ca06840     str    q0, [x2, x0]
400684:    91004000     add    x0, x0, #0x10
400688:    54ffff08     b.hi    400668 <main+0x108> // b.pmore
40068c:    4eb1b821     addv    s1, v1.4s
400690:    121e74e1     and    w1, w7, #0xfffffffc
400694:    0b060020     add    w0, w1, w6
400698:    4b010123     sub    w3, w9, w1
40069c:    6b0100ff     cmp    w7, w1
4006a0:    0e043c21     mov    w1, v1.s[0]
4006a4:    0b080021     add    w1, w1, w8
4006a8:    54000300     b.eq    400708 <main+0x1a8> // b.none
4006ac:    93407c05     sxtw    x5, w0
4006b0:    11000402     add    w2, w0, #0x1
4006b4:    7100047f     cmp    w3, #0x1
4006b8:    b8657aa4     ldr    w4, [x21, x5, lsl #2]
4006bc:    b8657ac6     ldr    w6, [x22, x5, lsl #2]
4006c0:    0b060084     add    w4, w4, w6
4006c4:    b8257ae4     str    w4, [x23, x5, lsl #2]
4006c8:    0b040021     add    w1, w1, w4
4006cc:    540001e0     b.eq    400708 <main+0x1a8> // b.none
4006d0:    93407c42     sxtw    x2, w2
4006d4:    7100087f     cmp    w3, #0x2
4006d8:    11000800     add    w0, w0, #0x2
4006dc:    b8627aa3     ldr    w3, [x21, x2, lsl #2]
4006e0:    b8627ac4     ldr    w4, [x22, x2, lsl #2]
4006e4:    0b040063     add    w3, w3, w4
4006e8:    b8227ae3     str    w3, [x23, x2, lsl #2]
4006ec:    0b030021     add    w1, w1, w3
4006f0:    540000c0     b.eq    400708 <main+0x1a8> // b.none
4006f4:    93407c00     sxtw    x0, w0
4006f8:    b8607aa2     ldr    w2, [x21, x0, lsl #2]
4006fc:    b8607ac0     ldr    w0, [x22, x0, lsl #2]
400700:    0b000040     add    w0, w2, w0
400704:    0b000021     add    w1, w1, w0
400708:    90000000     adrp    x0, 400000 <_init-0x4b8>
40070c:    91246000     add    x0, x0, #0x918
//printing the sum of 3rd array
400710:    97ffff90     bl    400550 <printf@plt>
400714:    910003bf     mov    sp, x29
400718:    52800000     mov    w0, #0x0                       // #0
40071c:    a94153f3     ldp    x19, x20, [sp, #16]
400720:    a9425bf5     ldp    x21, x22, [sp, #32]
400724:    a94363f7     ldp    x23, x24, [sp, #48]
400728:    a8c47bfd     ldp    x29, x30, [sp], #64
40072c:    d65f03c0     ret
400730:    52807ce9     mov    w9, #0x3e7                     // #999
400734:    2a0003e6     mov    w6, w0
400738:    17ffffc2     b    400640 <main+0xe0>
40073c:    52807d09     mov    w9, #0x3e8                     // #1000
400740:    52800008     mov    w8, #0x0                       // #0
400744:    52800006     mov    w6, #0x0                       // #0
400748:    17ffffbe     b    400640 <main+0xe0>
40074c:    52807cc9     mov    w9, #0x3e6                     // #998
400750:    52800046     mov    w6, #0x2                       // #2
400754:    17ffffbb     b    400640 <main+0xe0>

According to the page 18 of the great ARMv8 Instruction Set Overview (Yes, it is just an Overview. Considering the overview is already 112 pages long, I would imagine the actual guide is thicker than the Encyclopædia Britannica.), my program is using three 32 bits SIMD vector registers (with four lanes) to store my three arrays - v0.4s, v1.4s, and v2.4s. If you are wondering what does "v1.s[0]" mean, it simply means accessing a single SIMD vector element, according to the aforementioned Overview.

Reflection

Overall I would say it is an interesting lab. Writing the program itself is not hard (I finished the program in 5 minutes or so), but inspecting the disassembly code and prove it was vectorized were hardly something easy to do.

Search This Blog

SPO600

SPO 600 - Lab 4

Reflection

Comments

Post a Comment

Popular posts from this blog

SPO 600 Project - Stage 1

SPO 600 - Lab 5

SPO 600 Project - Stage 2