[PATCH][AArch64] Adjust writeback in non-zero memset

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH][AArch64] Adjust writeback in non-zero memset

Wilco Dijkstra-2
This fixes an ineffiency in the non-zero memset.  Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.

Tested against the GLIBC testsuite, OK for commit?

ChangeLog:
2018-11-07  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memset.S (MEMSET): Improve non-zero memset loop.

---

diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 4a454593618f78e22c55520d56737fab5d8f63a4..2eefc62fc1eeccf736f627a7adfe5485aff9bca9 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -89,10 +89,10 @@ L(set_long):
  b.eq L(try_zva)
 L(no_zva):
  sub count, dstend, dst /* Count is 16 too large.  */
- add dst, dst, 16
+ sub dst, dst, 16 /* Dst is biased by -32.  */
  sub count, count, 64 + 16 /* Adjust count and bias for loop.  */
-1: stp q0, q0, [dst], 64
- stp q0, q0, [dst, -32]
+1: stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
 L(tail64):
  subs count, count, 64
  b.hi 1b
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Adjust writeback in non-zero memset

Wilco Dijkstra-2
v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.

This fixes an ineffiency in the non-zero memset.  Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.

Tested against the GLIBC testsuite, OK for commit?

ChangeLog:
2018-11-14  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memset.S (MEMSET): Improve non-zero memset loop.

---

diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 4a454593618f78e22c55520d56737fab5d8f63a4..9738cf5fd55a1d937fb3392cec46f37b4d5fb51d 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -89,10 +89,10 @@ L(set_long):
     b.eq    L(try_zva)
 L(no_zva):
     sub    count, dstend, dst    /* Count is 16 too large.  */
-    add    dst, dst, 16
+    sub    dst, dst, 16        /* Dst is biased by -32.  */
     sub    count, count, 64 + 16    /* Adjust count and bias for loop.  */
-1:    stp    q0, q0, [dst], 64
-    stp    q0, q0, [dst, -32]
+1:    stp    q0, q0, [dst, 32]
+    stp    q0, q0, [dst, 64]!
 L(tail64):
     subs    count, count, 64
     b.hi    1b
@@ -183,6 +183,7 @@ L(zva_other):
     subs    count, count, zva_len
     b.hs    3b
 4:    add    count, count, zva_len
+    sub    dst, dst, 32        /* Bias dst for tail loop.  */
     b    L(tail64)
 #endif
   
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Adjust writeback in non-zero memset

Siddhesh Poyarekar-8
On 14/11/18 8:15 PM, Wilco Dijkstra wrote:
> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.
>
> This fixes an ineffiency in the non-zero memset.  Delaying the writeback
> until the end of the loop is slightly faster on some cores - this shows
> ~5% performance gain on Cortex-A53 when doing large non-zero memsets.
>
> Tested against the GLIBC testsuite, OK for commit?

Can you please also summarize the performance results for other processors?

Thanks,
Siddhesh
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Adjust writeback in non-zero memset

Wilco Dijkstra-2
Hi Siddhesh,

>On 14/11/18 8:15 PM, Wilco Dijkstra wrote:
>> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.
>>
>> This fixes an ineffiency in the non-zero memset.  Delaying the writeback
>> until the end of the loop is slightly faster on some cores - this shows
>> ~5% performance gain on Cortex-A53 when doing large non-zero memsets.
>>
>> Tested against the GLIBC testsuite, OK for commit?
>
> Can you please also summarize the performance results for other processors?

I ran it on Cortex-A72, and there isn't a performance difference on the memset bench
beyond measurement errors. On any out-of-order core address increments are
executed in parallel with loads/stores, ie. they have no measurable latency and so
their exact placement is irrelevant.

Wilco
   
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Adjust writeback in non-zero memset

Siddhesh Poyarekar-8
On 16/11/18 10:34 PM, Wilco Dijkstra wrote:
> I ran it on Cortex-A72, and there isn't a performance difference on the memset bench
> beyond measurement errors. On any out-of-order core address increments are
> executed in parallel with loads/stores, ie. they have no measurable latency and so
> their exact placement is irrelevant.

Thanks for the confirmation.  Looks OK to me.

Siddhesh