Re: [PATCH][AArch64] Optimized memcpy/memmove

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Optimized memcpy/memmove

Wilco Dijkstra-2


-----Original Message-----
From: Wilco Dijkstra [mailto:[hidden email]]
Sent: 25 September 2015 14:17
To: 'GNU C Library'
Subject: [PATCH][AArch64] Optimized memcpy/memmove

Further optimize memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40% faster on A57 and 28% on A53.

OK for commit?

ChangeLog:
2015-09-25  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memcpy.S (memcpy):
        Rewrite of optimized memcpy and memmove.
        * sysdeps/aarch64/memmove.S (memmove): Remove
        memmove code (merged into memcpy.S).

0001-Optimized-memcpy.txt (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Optimized memcpy/memmove

Wilco Dijkstra-2
ping

________________________________________
From: Wilco Dijkstra
Sent: 15 December 2015 16:40
To: 'GNU C Library'
Cc: nd
Subject: Re: [PATCH][AArch64] Optimized memcpy/memmove

-----Original Message-----
From: Wilco Dijkstra [mailto:[hidden email]]
Sent: 25 September 2015 14:17
To: 'GNU C Library'
Subject: [PATCH][AArch64] Optimized memcpy/memmove

Further optimize memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40% faster on A57 and 28% on A53.

OK for commit?

ChangeLog:
2015-09-25  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memcpy.S (memcpy):
        Rewrite of optimized memcpy and memmove.
        * sysdeps/aarch64/memmove.S (memmove): Remove
        memmove code (merged into memcpy.S).

0001-Optimized-memcpy.txt (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Optimized memcpy/memmove

Wilco Dijkstra-2
In reply to this post by Wilco Dijkstra-2

ping


-----Original Message-----
From: Wilco Dijkstra [mailto:[hidden email]]
Sent: 25 September 2015 14:17
To: 'GNU C Library'
Subject: [PATCH][AArch64] Optimized memcpy/memmove

Further optimize memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40% faster on A57 and 28% on A53.

OK for commit?

ChangeLog:
2015-09-25  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memcpy.S (memcpy):
        Rewrite of optimized memcpy and memmove.
        * sysdeps/aarch64/memmove.S (memmove): Remove
        memmove code (merged into memcpy.S).

0001-Optimized-memcpy.txt (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Optimized memcpy/memmove

Marcus
In reply to this post by Wilco Dijkstra-2
On 15 December 2015 at 16:40, Wilco Dijkstra <[hidden email]> wrote:

>
>
> -----Original Message-----
> From: Wilco Dijkstra [mailto:[hidden email]]
> Sent: 25 September 2015 14:17
> To: 'GNU C Library'
> Subject: [PATCH][AArch64] Optimized memcpy/memmove
>
> Further optimize memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40% faster on A57 and 28% on A53.
>
> OK for commit?
>
> ChangeLog:
> 2015-09-25  Wilco Dijkstra  <[hidden email]>
>
>         * sysdeps/aarch64/memcpy.S (memcpy):
>         Rewrite of optimized memcpy and memmove.
>         * sysdeps/aarch64/memmove.S (memmove): Remove
>         memmove code (merged into memcpy.S).

Hi, There appear to be odd tab characters inserted throughout the
comments,  for example:

-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU

and

+   boundaries on both loads and stores. There are at least 96 bytes
+   to copy, so copy 16 bytes unaligned and then align. The loop

Please fix, (and check for other instances) and post a the respin.

Thanks
/Marcus
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Optimized memcpy/memmove

Wilco Dijkstra-2

Marcus Shawcroft wrote:
> Hi, There appear to be odd tab characters inserted throughout the
> comments,  for example:

Hmm, it appears unexpand is buggy, so you end up having to do tabs
all by hand... Attached updated version.

Wilco

---
This is an optimized memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40.8% faster on A57 and 28.4% on A53.

ChangeLog:
2015-07-08  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memcpy.S (memcpy):
        Rewrite of optimized memcpy and memmove.
        * sysdeps/aarch64/memmove.S (memmove): Remove
        memmove code (merged into memcpy.S).


memcpy.patch (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Optimized memcpy/memmove

Wilco Dijkstra-2
ping

________________________________________
From: Wilco Dijkstra
Sent: 12 May 2016 17:25
To: Marcus Shawcroft
Cc: GNU C Library; nd
Subject: Re: [PATCH][AArch64] Optimized memcpy/memmove

Marcus Shawcroft wrote:
> Hi, There appear to be odd tab characters inserted throughout the
> comments,  for example:

Hmm, it appears unexpand is buggy, so you end up having to do tabs
all by hand... Attached updated version.

Wilco

---
This is an optimized memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40.8% faster on A57 and 28.4% on A53.

ChangeLog:
2015-07-08  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memcpy.S (memcpy):
        Rewrite of optimized memcpy and memmove.
        * sysdeps/aarch64/memmove.S (memmove): Remove
        memmove code (merged into memcpy.S).


memcpy.patch (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Fw: [PATCH][AArch64] Optimized memcpy/memmove

Wilco Dijkstra-2


ping

________________________________________
From: Wilco Dijkstra
Sent: 12 May 2016 17:25
To: Marcus Shawcroft
Cc: GNU C Library; nd
Subject: Re: [PATCH][AArch64] Optimized memcpy/memmove

Marcus Shawcroft wrote:
> Hi, There appear to be odd tab characters inserted throughout the
> comments,  for example:

Hmm, it appears unexpand is buggy, so you end up having to do tabs
all by hand... Attached updated version.

Wilco

---
This is an optimized memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40.8% faster on A57 and 28.4% on A53.

ChangeLog:
2015-07-08  Wilco Dijkstra  <[hidden email]>

        * sysdeps/aarch64/memcpy.S (memcpy):
        Rewrite of optimized memcpy and memmove.
        * sysdeps/aarch64/memmove.S (memmove): Remove
        memmove code (merged into memcpy.S).


memcpy.patch (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH][AArch64] Optimized memcpy/memmove

Marcus
In reply to this post by Wilco Dijkstra-2
On 12 May 2016 at 17:25, Wilco Dijkstra <[hidden email]> wrote:

>
> Marcus Shawcroft wrote:
>> Hi, There appear to be odd tab characters inserted throughout the
>> comments,  for example:
>
> Hmm, it appears unexpand is buggy, so you end up having to do tabs
> all by hand... Attached updated version.
>
> Wilco
>
> ---
> This is an optimized memcpy/memmove for AArch64. Copies are split into 3 main cases: small copies of up to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. Large copies of more than 96 bytes align the destination and use an unrolled loop processing 64 bytes per iteration. In order to share code with memmove, small and medium copies read all data before writing, allowing any kind of overlap. All memmoves except for the large backwards case fall into memcpy for optimal performance. On a random copy test memcpy/memmove are 40.8% faster on A57 and 28.4% on A53.
>
> ChangeLog:
> 2015-07-08  Wilco Dijkstra  <[hidden email]>
>
>         * sysdeps/aarch64/memcpy.S (memcpy):
>         Rewrite of optimized memcpy and memmove.
>         * sysdeps/aarch64/memmove.S (memmove): Remove
>         memmove code (merged into memcpy.S).
>

Thanks Wilco. OK. /Marcus