SPO 600 - Lab 4
In Lab 4, I will be looking at Single Instruction Multiple Data (SIMD) and Auto-Vectorization. Simply put, vectorization is to process one operation on multiple pairs of operands at the same time, to make the program faster.
For this lab, I will have to write a C program that creates two 1000-element integer arrays, fill them in with random numbers within the range -1000 to +1000, sums these two arrays element-by-element into a third array, calculate the sum of the third array, then display it. I will test the program in Aarch64 environment.
Here is the source code:
Then I compile the program using gcc with auto-vectorization enabled. The command is:
gcc -O3 -o lab4 lab4.c (Note: Vectorization is enabled by default at -O3)
Using objdump, I get the disassembly of the <main> section of my program, and I can see the auto-vectorization is doing its job (I bolded those instructions, just so it is easier to read):
0000000000400560 <main>:
//initialize the three arrays
400560: a9bc7bfd stp x29, x30, [sp, #-64]!
400564: d2800000 mov x0, #0x0 // #0
400568: 910003fd mov x29, sp
40056c: a90153f3 stp x19, x20, [sp, #16]
400570: 529a9c74 mov w20, #0xd4e3 // #54499
400574: a9025bf5 stp x21, x22, [sp, #32]
400578: 72a83014 movk w20, #0x4180, lsl #16
40057c: a90363f7 stp x23, x24, [sp, #48]
400580: d13ec3ff sub sp, sp, #0xfb0
400584: 910003f5 mov x21, sp
400588: d13ec3ff sub sp, sp, #0xfb0
40058c: 910003f6 mov x22, sp
400590: d13ec3ff sub sp, sp, #0xfb0
400594: d2800018 mov x24, #0x0 // #0
400598: 910003f7 mov x23, sp
40059c: 5280fa33 mov w19, #0x7d1 // #2001
//calling the time, srand and rand subroutines to generate the random numbers
4005a0: 97ffffd4 bl 4004f0 <time@plt>
4005a4: 97ffffe7 bl 400540 <srand@plt>
4005a8: 97ffffda bl 400510 <rand@plt>
4005ac: 9b347c01 smull x1, w0, w20
4005b0: 9369fc21 asr x1, x1, #41
4005b4: 4b807c21 sub w1, w1, w0, asr #31
4005b8: 1b138020 msub w0, w1, w19, w0
4005bc: b8386aa0 str w0, [x21, x24]
4005c0: 97ffffd4 bl 400510 <rand@plt>
4005c4: 9b347c01 smull x1, w0, w20
4005c8: 9369fc21 asr x1, x1, #41
4005cc: 4b807c21 sub w1, w1, w0, asr #31
4005d0: 1b138020 msub w0, w1, w19, w0
4005d4: b8386ac0 str w0, [x22, x24]
//increment the loop counter
4005d8: 91001318 add x24, x24, #0x4
//check if loop counter equals to 1000, if not then continue
4005dc: f13e831f cmp x24, #0xfa0
4005e0: 54fffe41 b.ne 4005a8 <main+0x48> // b.any
4005e4: cb550be0 neg x0, x21, lsr #2
4005e8: 72000400 ands w0, w0, #0x3
4005ec: 54000a80 b.eq 40073c <main+0x1dc> // b.none
4005f0: b94002a1 ldr w1, [x21]
4005f4: 7100041f cmp w0, #0x1
4005f8: b94002c8 ldr w8, [x22]
4005fc: 0b080028 add w8, w1, w8
400600: b90002e8 str w8, [x23]
400604: 54000960 b.eq 400730 <main+0x1d0> // b.none
400608: b94006a1 ldr w1, [x21, #4]
40060c: 71000c1f cmp w0, #0x3
400610: b94006c2 ldr w2, [x22, #4]
400614: 0b020021 add w1, w1, w2
400618: b90006e1 str w1, [x23, #4]
40061c: 0b010108 add w8, w8, w1
400620: 54000961 b.ne 40074c <main+0x1ec> // b.any
400624: b9400aa2 ldr w2, [x21, #8]
400628: 2a0003e6 mov w6, w0
40062c: b9400ac1 ldr w1, [x22, #8]
400630: 52807ca9 mov w9, #0x3e5 // #997
400634: 0b020021 add w1, w1, w2
400638: b9000ae1 str w1, [x23, #8]
40063c: 0b010108 add w8, w8, w1
400640: 4f000401 movi v1.4s, #0x0
400644: d37e0402 ubfiz x2, x0, #2, #2
400648: 52807d07 mov w7, #0x3e8 // #1000
40064c: 4b0000e7 sub w7, w7, w0
400650: 8b0202a4 add x4, x21, x2
400654: 8b0202c3 add x3, x22, x2
400658: 53027ce5 lsr w5, w7, #2
40065c: 8b0202e2 add x2, x23, x2
400660: d2800000 mov x0, #0x0 // #0
400664: 52800001 mov w1, #0x0 // #0
400668: 3ce06860 ldr q0, [x3, x0]
40066c: 11000421 add w1, w1, #0x1
400670: 3ce06882 ldr q2, [x4, x0]
400674: 6b0100bf cmp w5, w1
//add the two arrays to the 3rd, and calculate the sum
400678: 4ea28400 add v0.4s, v0.4s, v2.4s
40067c: 4ea08421 add v1.4s, v1.4s, v0.4s
400680: 3ca06840 str q0, [x2, x0]
400684: 91004000 add x0, x0, #0x10
400688: 54ffff08 b.hi 400668 <main+0x108> // b.pmore
40068c: 4eb1b821 addv s1, v1.4s
400690: 121e74e1 and w1, w7, #0xfffffffc
400694: 0b060020 add w0, w1, w6
400698: 4b010123 sub w3, w9, w1
40069c: 6b0100ff cmp w7, w1
4006a0: 0e043c21 mov w1, v1.s[0]
4006a4: 0b080021 add w1, w1, w8
4006a8: 54000300 b.eq 400708 <main+0x1a8> // b.none
4006ac: 93407c05 sxtw x5, w0
4006b0: 11000402 add w2, w0, #0x1
4006b4: 7100047f cmp w3, #0x1
4006b8: b8657aa4 ldr w4, [x21, x5, lsl #2]
4006bc: b8657ac6 ldr w6, [x22, x5, lsl #2]
4006c0: 0b060084 add w4, w4, w6
4006c4: b8257ae4 str w4, [x23, x5, lsl #2]
4006c8: 0b040021 add w1, w1, w4
4006cc: 540001e0 b.eq 400708 <main+0x1a8> // b.none
4006d0: 93407c42 sxtw x2, w2
4006d4: 7100087f cmp w3, #0x2
4006d8: 11000800 add w0, w0, #0x2
4006dc: b8627aa3 ldr w3, [x21, x2, lsl #2]
4006e0: b8627ac4 ldr w4, [x22, x2, lsl #2]
4006e4: 0b040063 add w3, w3, w4
4006e8: b8227ae3 str w3, [x23, x2, lsl #2]
4006ec: 0b030021 add w1, w1, w3
4006f0: 540000c0 b.eq 400708 <main+0x1a8> // b.none
4006f4: 93407c00 sxtw x0, w0
4006f8: b8607aa2 ldr w2, [x21, x0, lsl #2]
4006fc: b8607ac0 ldr w0, [x22, x0, lsl #2]
400700: 0b000040 add w0, w2, w0
400704: 0b000021 add w1, w1, w0
400708: 90000000 adrp x0, 400000 <_init-0x4b8>
40070c: 91246000 add x0, x0, #0x918
//printing the sum of 3rd array
400710: 97ffff90 bl 400550 <printf@plt>
400714: 910003bf mov sp, x29
400718: 52800000 mov w0, #0x0 // #0
40071c: a94153f3 ldp x19, x20, [sp, #16]
400720: a9425bf5 ldp x21, x22, [sp, #32]
400724: a94363f7 ldp x23, x24, [sp, #48]
400728: a8c47bfd ldp x29, x30, [sp], #64
40072c: d65f03c0 ret
400730: 52807ce9 mov w9, #0x3e7 // #999
400734: 2a0003e6 mov w6, w0
400738: 17ffffc2 b 400640 <main+0xe0>
40073c: 52807d09 mov w9, #0x3e8 // #1000
400740: 52800008 mov w8, #0x0 // #0
400744: 52800006 mov w6, #0x0 // #0
400748: 17ffffbe b 400640 <main+0xe0>
40074c: 52807cc9 mov w9, #0x3e6 // #998
400750: 52800046 mov w6, #0x2 // #2
400754: 17ffffbb b 400640 <main+0xe0>
According to the page 18 of the great ARMv8 Instruction Set Overview (Yes, it is just an Overview. Considering the overview is already 112 pages long, I would imagine the actual guide is thicker than the Encyclopædia Britannica.), my program is using three 32 bits SIMD vector registers (with four lanes) to store my three arrays - v0.4s, v1.4s, and v2.4s. If you are wondering what does "v1.s[0]" mean, it simply means accessing a single SIMD vector element, according to the aforementioned Overview.
For this lab, I will have to write a C program that creates two 1000-element integer arrays, fill them in with random numbers within the range -1000 to +1000, sums these two arrays element-by-element into a third array, calculate the sum of the third array, then display it. I will test the program in Aarch64 environment.
Here is the source code:
#include <stdio.h> #include <stdlib.h> #include <time.h> int main(){ const int SIZE = 1000; int a[SIZE], b[SIZE], c[SIZE]; int min = -1000, max = 1000; srand(time(NULL)); for(int i = 0; i < SIZE; i++){ a[i] = rand()%(max+1-min); b[i] = rand()%(max+1-min); } int sum = 0; for(int j = 0; j < SIZE; j++){ c[j] = a[j] + b[j];
sum += c[j];
} printf("Sum = %d\n", sum); }
Then I compile the program using gcc with auto-vectorization enabled. The command is:
gcc -O3 -o lab4 lab4.c (Note: Vectorization is enabled by default at -O3)
Using objdump, I get the disassembly of the <main> section of my program, and I can see the auto-vectorization is doing its job (I bolded those instructions, just so it is easier to read):
0000000000400560 <main>:
//initialize the three arrays
400560: a9bc7bfd stp x29, x30, [sp, #-64]!
400564: d2800000 mov x0, #0x0 // #0
400568: 910003fd mov x29, sp
40056c: a90153f3 stp x19, x20, [sp, #16]
400570: 529a9c74 mov w20, #0xd4e3 // #54499
400574: a9025bf5 stp x21, x22, [sp, #32]
400578: 72a83014 movk w20, #0x4180, lsl #16
40057c: a90363f7 stp x23, x24, [sp, #48]
400580: d13ec3ff sub sp, sp, #0xfb0
400584: 910003f5 mov x21, sp
400588: d13ec3ff sub sp, sp, #0xfb0
40058c: 910003f6 mov x22, sp
400590: d13ec3ff sub sp, sp, #0xfb0
400594: d2800018 mov x24, #0x0 // #0
400598: 910003f7 mov x23, sp
40059c: 5280fa33 mov w19, #0x7d1 // #2001
//calling the time, srand and rand subroutines to generate the random numbers
4005a0: 97ffffd4 bl 4004f0 <time@plt>
4005a4: 97ffffe7 bl 400540 <srand@plt>
4005a8: 97ffffda bl 400510 <rand@plt>
4005ac: 9b347c01 smull x1, w0, w20
4005b0: 9369fc21 asr x1, x1, #41
4005b4: 4b807c21 sub w1, w1, w0, asr #31
4005b8: 1b138020 msub w0, w1, w19, w0
4005bc: b8386aa0 str w0, [x21, x24]
4005c0: 97ffffd4 bl 400510 <rand@plt>
4005c4: 9b347c01 smull x1, w0, w20
4005c8: 9369fc21 asr x1, x1, #41
4005cc: 4b807c21 sub w1, w1, w0, asr #31
4005d0: 1b138020 msub w0, w1, w19, w0
4005d4: b8386ac0 str w0, [x22, x24]
//increment the loop counter
4005d8: 91001318 add x24, x24, #0x4
//check if loop counter equals to 1000, if not then continue
4005dc: f13e831f cmp x24, #0xfa0
4005e0: 54fffe41 b.ne 4005a8 <main+0x48> // b.any
4005e4: cb550be0 neg x0, x21, lsr #2
4005e8: 72000400 ands w0, w0, #0x3
4005ec: 54000a80 b.eq 40073c <main+0x1dc> // b.none
4005f0: b94002a1 ldr w1, [x21]
4005f4: 7100041f cmp w0, #0x1
4005f8: b94002c8 ldr w8, [x22]
4005fc: 0b080028 add w8, w1, w8
400600: b90002e8 str w8, [x23]
400604: 54000960 b.eq 400730 <main+0x1d0> // b.none
400608: b94006a1 ldr w1, [x21, #4]
40060c: 71000c1f cmp w0, #0x3
400610: b94006c2 ldr w2, [x22, #4]
400614: 0b020021 add w1, w1, w2
400618: b90006e1 str w1, [x23, #4]
40061c: 0b010108 add w8, w8, w1
400620: 54000961 b.ne 40074c <main+0x1ec> // b.any
400624: b9400aa2 ldr w2, [x21, #8]
400628: 2a0003e6 mov w6, w0
40062c: b9400ac1 ldr w1, [x22, #8]
400630: 52807ca9 mov w9, #0x3e5 // #997
400634: 0b020021 add w1, w1, w2
400638: b9000ae1 str w1, [x23, #8]
40063c: 0b010108 add w8, w8, w1
400640: 4f000401 movi v1.4s, #0x0
400644: d37e0402 ubfiz x2, x0, #2, #2
400648: 52807d07 mov w7, #0x3e8 // #1000
40064c: 4b0000e7 sub w7, w7, w0
400650: 8b0202a4 add x4, x21, x2
400654: 8b0202c3 add x3, x22, x2
400658: 53027ce5 lsr w5, w7, #2
40065c: 8b0202e2 add x2, x23, x2
400660: d2800000 mov x0, #0x0 // #0
400664: 52800001 mov w1, #0x0 // #0
400668: 3ce06860 ldr q0, [x3, x0]
40066c: 11000421 add w1, w1, #0x1
400670: 3ce06882 ldr q2, [x4, x0]
400674: 6b0100bf cmp w5, w1
//add the two arrays to the 3rd, and calculate the sum
400678: 4ea28400 add v0.4s, v0.4s, v2.4s
40067c: 4ea08421 add v1.4s, v1.4s, v0.4s
400680: 3ca06840 str q0, [x2, x0]
400684: 91004000 add x0, x0, #0x10
400688: 54ffff08 b.hi 400668 <main+0x108> // b.pmore
40068c: 4eb1b821 addv s1, v1.4s
400690: 121e74e1 and w1, w7, #0xfffffffc
400694: 0b060020 add w0, w1, w6
400698: 4b010123 sub w3, w9, w1
40069c: 6b0100ff cmp w7, w1
4006a0: 0e043c21 mov w1, v1.s[0]
4006a4: 0b080021 add w1, w1, w8
4006a8: 54000300 b.eq 400708 <main+0x1a8> // b.none
4006ac: 93407c05 sxtw x5, w0
4006b0: 11000402 add w2, w0, #0x1
4006b4: 7100047f cmp w3, #0x1
4006b8: b8657aa4 ldr w4, [x21, x5, lsl #2]
4006bc: b8657ac6 ldr w6, [x22, x5, lsl #2]
4006c0: 0b060084 add w4, w4, w6
4006c4: b8257ae4 str w4, [x23, x5, lsl #2]
4006c8: 0b040021 add w1, w1, w4
4006cc: 540001e0 b.eq 400708 <main+0x1a8> // b.none
4006d0: 93407c42 sxtw x2, w2
4006d4: 7100087f cmp w3, #0x2
4006d8: 11000800 add w0, w0, #0x2
4006dc: b8627aa3 ldr w3, [x21, x2, lsl #2]
4006e0: b8627ac4 ldr w4, [x22, x2, lsl #2]
4006e4: 0b040063 add w3, w3, w4
4006e8: b8227ae3 str w3, [x23, x2, lsl #2]
4006ec: 0b030021 add w1, w1, w3
4006f0: 540000c0 b.eq 400708 <main+0x1a8> // b.none
4006f4: 93407c00 sxtw x0, w0
4006f8: b8607aa2 ldr w2, [x21, x0, lsl #2]
4006fc: b8607ac0 ldr w0, [x22, x0, lsl #2]
400700: 0b000040 add w0, w2, w0
400704: 0b000021 add w1, w1, w0
400708: 90000000 adrp x0, 400000 <_init-0x4b8>
40070c: 91246000 add x0, x0, #0x918
//printing the sum of 3rd array
400710: 97ffff90 bl 400550 <printf@plt>
400714: 910003bf mov sp, x29
400718: 52800000 mov w0, #0x0 // #0
40071c: a94153f3 ldp x19, x20, [sp, #16]
400720: a9425bf5 ldp x21, x22, [sp, #32]
400724: a94363f7 ldp x23, x24, [sp, #48]
400728: a8c47bfd ldp x29, x30, [sp], #64
40072c: d65f03c0 ret
400730: 52807ce9 mov w9, #0x3e7 // #999
400734: 2a0003e6 mov w6, w0
400738: 17ffffc2 b 400640 <main+0xe0>
40073c: 52807d09 mov w9, #0x3e8 // #1000
400740: 52800008 mov w8, #0x0 // #0
400744: 52800006 mov w6, #0x0 // #0
400748: 17ffffbe b 400640 <main+0xe0>
40074c: 52807cc9 mov w9, #0x3e6 // #998
400750: 52800046 mov w6, #0x2 // #2
400754: 17ffffbb b 400640 <main+0xe0>
According to the page 18 of the great ARMv8 Instruction Set Overview (Yes, it is just an Overview. Considering the overview is already 112 pages long, I would imagine the actual guide is thicker than the Encyclopædia Britannica.), my program is using three 32 bits SIMD vector registers (with four lanes) to store my three arrays - v0.4s, v1.4s, and v2.4s. If you are wondering what does "v1.s[0]" mean, it simply means accessing a single SIMD vector element, according to the aforementioned Overview.
Comments
Post a Comment