[PATCH v4] faster strlen on x64

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH v4] faster strlen on x64

Ondřej Bílka
Hello,

I wrote at previous version that unaligned read of first 16 bytes is bad
tradeoff. When I made faster strcpy header I realized that it was because
I was doing separate check if it crosses page.

When I do only check if next 64 bytes do not cross page and first do
unaligned 16 byte load then it causes only small overhead for larger
strings. This makes my implementation faster for wider family of
workloads. It speed up gcc benchmark and most other programs.

On unit tests revised version is somewhat slower than previous version.
It is caused by choosing first 16 bytes only rarely which causes branch
misprediction.

I did two additional small improvements, first is squashing padding patch.
Second bit is test to cross page can be done as x%4096 < 4096-48 instead
x%4096 <= 4096-64 because I align x into 16 bytes.

I updated benchmarks, difference between new and revised version is at
http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
 

Ondra

2013-01-31  Ondrej Bilka  <[hidden email]>

  * sysdeps/x86_64/strlen.S: Replace with new SSE2 based
  implementation which is faster on all x86_64 architectures.
  Tested on AMD, Intel Nehalem, SNB, IVB.
  * sysdeps/x86_64/strnlen.S: Likewise.

  * sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
  Remove all multiarch strlen and strnlen versions.
  * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update.
  Remove strlen and strnlen related parts.

  * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Update.
  Inline strlen part.
  * sysdeps/x86_64/multiarch/strcat-ssse3.S: Likewise.

  * sysdeps/x86_64/multiarch/strlen.S: Remove.
  * sysdeps/x86_64/multiarch/strlen-sse2-no-bsf.S: Remove.
  * sysdeps/x86_64/multiarch/strlen-sse2-pminub.S: Remove.
  * sysdeps/x86_64/multiarch/rtld-strlen.S: Remove.
  * sysdeps/x86_64/multiarch/strlen-sse4.S: Remove.
  * sysdeps/x86_64/multiarch/strnlen.S: Remove.
  * sysdeps/x86_64/multiarch/strnlen-sse2-no-bsf.S: Remove.

0001-Faster-strlen-on-x86-64.patch (46K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Ping: [PATCH v4] faster strlen on x64

Ondřej Bílka
Ping,


On Wed, Feb 13, 2013 at 12:38:40PM +0100, Ondřej Bílka wrote:

> Hello,
>
> I wrote at previous version that unaligned read of first 16 bytes is bad
> tradeoff. When I made faster strcpy header I realized that it was because
> I was doing separate check if it crosses page.
>
> When I do only check if next 64 bytes do not cross page and first do
> unaligned 16 byte load then it causes only small overhead for larger
> strings. This makes my implementation faster for wider family of
> workloads. It speed up gcc benchmark and most other programs.
>
> On unit tests revised version is somewhat slower than previous version.
> It is caused by choosing first 16 bytes only rarely which causes branch
> misprediction.
>
> I did two additional small improvements, first is squashing padding patch.
> Second bit is test to cross page can be done as x%4096 < 4096-48 instead
> x%4096 <= 4096-64 because I align x into 16 bytes.
>
> I updated benchmarks, difference between new and revised version is at
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
> http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
>  
>
> Ondra

Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Liubov Dmitrieva
Hi,

I remember you checked at least previous version on Atom but just
don't put it here
http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
Right?

--
Liubov Dmitrieva
Intel Corporation

2013/2/25 Ondřej Bílka <[hidden email]>:

> Ping,
>
>
> On Wed, Feb 13, 2013 at 12:38:40PM +0100, Ondřej Bílka wrote:
>> Hello,
>>
>> I wrote at previous version that unaligned read of first 16 bytes is bad
>> tradeoff. When I made faster strcpy header I realized that it was because
>> I was doing separate check if it crosses page.
>>
>> When I do only check if next 64 bytes do not cross page and first do
>> unaligned 16 byte load then it causes only small overhead for larger
>> strings. This makes my implementation faster for wider family of
>> workloads. It speed up gcc benchmark and most other programs.
>>
>> On unit tests revised version is somewhat slower than previous version.
>> It is caused by choosing first 16 bytes only rarely which causes branch
>> misprediction.
>>
>> I did two additional small improvements, first is squashing padding patch.
>> Second bit is test to cross page can be done as x%4096 < 4096-48 instead
>> x%4096 <= 4096-64 because I align x into 16 bytes.
>>
>> I updated benchmarks, difference between new and revised version is at
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
>> http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
>>
>>
>> Ondra
>
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Liubov Dmitrieva
In reply to this post by Ondřej Bílka
I've tried to run that test suite on Haswell machine we have to
compare _revised version and _new version but got Segmentation fault.
I downloaded the archive, extracted all, and run at the test directory
"make" and "./benchmarks" commands one by one.
When ./benchmarks script called ./report binary the program broke.

The stack is:
Program received signal SIGSEGV, Segmentation fault.
0x000000301524d4d8 in __printf_fp () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.15-57.fc17.x86_64
(gdb) bt
#0  0x000000301524d4d8 in __printf_fp () from /lib64/libc.so.6
#1  0x000000301524a748 in vfprintf () from /lib64/libc.so.6
#2  0x000000301526e124 in vsprintf () from /lib64/libc.so.6
#3  0x0000003015250987 in sprintf () from /lib64/libc.so.6
#4  0x0000000000402984 in report_fn (smp=0x7ffff7fee000,
fname=0x403d47 "function", flags=0, binaries=0x7ffff7ffbd20) at
report.c:91
#5  0x0000000000403603 in main () at functions.h:1


--
Liubov Dmitrieva

2013/2/25 Ondřej Bílka <[hidden email]>:

> Ping,
>
>
> On Wed, Feb 13, 2013 at 12:38:40PM +0100, Ondřej Bílka wrote:
>> Hello,
>>
>> I wrote at previous version that unaligned read of first 16 bytes is bad
>> tradeoff. When I made faster strcpy header I realized that it was because
>> I was doing separate check if it crosses page.
>>
>> When I do only check if next 64 bytes do not cross page and first do
>> unaligned 16 byte load then it causes only small overhead for larger
>> strings. This makes my implementation faster for wider family of
>> workloads. It speed up gcc benchmark and most other programs.
>>
>> On unit tests revised version is somewhat slower than previous version.
>> It is caused by choosing first 16 bytes only rarely which causes branch
>> misprediction.
>>
>> I did two additional small improvements, first is squashing padding patch.
>> Second bit is test to cross page can be done as x%4096 < 4096-48 instead
>> x%4096 <= 4096-64 because I align x into 16 bytes.
>>
>> I updated benchmarks, difference between new and revised version is at
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
>> http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
>>
>>
>> Ondra
>
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Ondřej Bílka
In reply to this post by Liubov Dmitrieva
On Thu, Feb 28, 2013 at 04:14:47PM +0400, Dmitrieva Liubov wrote:
> Hi,
>
> I remember you checked at least previous version on Atom but just
> don't put it here
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
> Right?
Yes, atom is a notebook which I need to borrow and I did not have it at
that time.

>
> --
> Liubov Dmitrieva
> Intel Corporation
>
> 2013/2/25 Ondřej Bílka <[hidden email]>:
> > Ping,
> >
> >
> > On Wed, Feb 13, 2013 at 12:38:40PM +0100, Ondřej Bílka wrote:
> >> Hello,
> >>
> >> I wrote at previous version that unaligned read of first 16 bytes is bad
> >> tradeoff. When I made faster strcpy header I realized that it was because
> >> I was doing separate check if it crosses page.
> >>
> >> When I do only check if next 64 bytes do not cross page and first do
> >> unaligned 16 byte load then it causes only small overhead for larger
> >> strings. This makes my implementation faster for wider family of
> >> workloads. It speed up gcc benchmark and most other programs.
> >>
> >> On unit tests revised version is somewhat slower than previous version.
> >> It is caused by choosing first 16 bytes only rarely which causes branch
> >> misprediction.
> >>
> >> I did two additional small improvements, first is squashing padding patch.
> >> Second bit is test to cross page can be done as x%4096 < 4096-48 instead
> >> x%4096 <= 4096-64 because I align x into 16 bytes.
> >>
> >> I updated benchmarks, difference between new and revised version is at
> >> http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
> >> http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
> >>
> >>
> >> Ondra
> >

Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Ondřej Bílka
In reply to this post by Liubov Dmitrieva

I uploaded results to my page.
If nobody has objection I will commit it tomorrow.

On Thu, Feb 28, 2013 at 06:10:57PM +0400, Dmitrieva Liubov wrote:

> Looks like for Haswell _new version is better that _revised for big
> lengths (according to results_rand/result.html).
> But you can submit any of them as both are better than the current
> one. We don't have any objections to reject the patch on our side.
>
> --
> Liubov Dmitrieva
>
> 2013/2/28 Dmitrieva Liubov <[hidden email]>:
> > Yes. This helps.
> >
> > Keep Haswell results.
> >
> > 2013/2/28 Ondřej Bílka <[hidden email]>:
> >> On Thu, Feb 28, 2013 at 04:29:38PM +0400, Dmitrieva Liubov wrote:
> >>> I've tried to run that test suite on Haswell machine we have to
> >>> compare _revised version and _new version but got Segmentation fault.
> >>> I downloaded the archive, extracted all, and run at the test directory
> >>> "make" and "./benchmarks" commands one by one.
> >>> When ./benchmarks script called ./report binary the program broke.
> >>>
> >>> The stack is:
> >>> Program received signal SIGSEGV, Segmentation fault.
> >>> 0x000000301524d4d8 in __printf_fp () from /lib64/libc.so.6
> >>> Missing separate debuginfos, use: debuginfo-install glibc-2.15-57.fc17.x86_64
> >>> (gdb) bt
> >>> #0  0x000000301524d4d8 in __printf_fp () from /lib64/libc.so.6
> >>> #1  0x000000301524a748 in vfprintf () from /lib64/libc.so.6
> >>> #2  0x000000301526e124 in vsprintf () from /lib64/libc.so.6
> >>> #3  0x0000003015250987 in sprintf () from /lib64/libc.so.6
> >>> #4  0x0000000000402984 in report_fn (smp=0x7ffff7fee000,
> >>> fname=0x403d47 "function", flags=0, binaries=0x7ffff7ffbd20) at
> >>> report.c:91
> >>> #5  0x0000000000403603 in main () at functions.h:1
> >>>
> >> It is weird it gives segfault there. Only problem I see is that I disabled avx2
> >> compilation because it did not work with older binutils.
> >>
> >> I did disable it completely detection at file test_sse needs to be modified in following.
> >>
> >> if [ -z "$AVX2" ]
> >> then
> >> echo 3
> >> else
> >> echo 4 # replace with 3
> >> fi
> >>
> >> Tell me if this helps.
> >>>
> >>> --
> >>> Liubov Dmitrieva
> >>>
> >>> 2013/2/25 Ondřej Bílka <[hidden email]>:
> >>> > Ping,
> >>> >
> >>> >
> >>> > On Wed, Feb 13, 2013 at 12:38:40PM +0100, Ondřej Bílka wrote:
> >>> >> Hello,
> >>> >>
> >>> >> I wrote at previous version that unaligned read of first 16 bytes is bad
> >>> >> tradeoff. When I made faster strcpy header I realized that it was because
> >>> >> I was doing separate check if it crosses page.
> >>> >>
> >>> >> When I do only check if next 64 bytes do not cross page and first do
> >>> >> unaligned 16 byte load then it causes only small overhead for larger
> >>> >> strings. This makes my implementation faster for wider family of
> >>> >> workloads. It speed up gcc benchmark and most other programs.
> >>> >>
> >>> >> On unit tests revised version is somewhat slower than previous version.
> >>> >> It is caused by choosing first 16 bytes only rarely which causes branch
> >>> >> misprediction.
> >>> >>
> >>> >> I did two additional small improvements, first is squashing padding patch.
> >>> >> Second bit is test to cross page can be done as x%4096 < 4096-48 instead
> >>> >> x%4096 <= 4096-64 because I align x into 16 bytes.
> >>> >>
> >>> >> I updated benchmarks, difference between new and revised version is at
> >>> >> http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
> >>> >> http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
> >>> >>
> >>> >>
> >>> >> Ondra
> >>> >
> >>
> >> --
> >>
> >> Your mail is being routed through Germany ... and they're censoring us.

Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Ondřej Bílka
On Tue, Mar 05, 2013 at 06:39:15PM +0100, Ondřej Bílka wrote:
>
> I uploaded results to my page.
> If nobody has objection I will commit it tomorrow.
Commited.
>
Ondra
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Andreas Jaeger-8
On 03/06/2013 10:00 PM, Ondřej Bílka wrote:
> On Tue, Mar 05, 2013 at 06:39:15PM +0100, Ondřej Bílka wrote:
>>
>> I uploaded results to my page.
>> If nobody has objection I will commit it tomorrow.
> Commited.

Ondrej, I feel ashamed that nobody had time to review your patches in time.

Still, our rules ask for review by somebody else and thus I'm surprised
by your commit. Silent approval ("if nobody objects") without any kind
of review and approval is not the way to go.

Please revert the patch and let's give it a proper review with explicit
aprovals before you commit!

I'll do a first review now.

Next time if nobody reviews your submission, please ask gently again and
again...

Andreas
--
  Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
   SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
    GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
     GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4] faster strlen on x64

Andreas Jaeger-8
In reply to this post by Ondřej Bílka
On 02/13/2013 12:38 PM, Ondřej Bílka wrote:

> Hello,
>
> I wrote at previous version that unaligned read of first 16 bytes is bad
> tradeoff. When I made faster strcpy header I realized that it was because
> I was doing separate check if it crosses page.
>
> When I do only check if next 64 bytes do not cross page and first do
> unaligned 16 byte load then it causes only small overhead for larger
> strings. This makes my implementation faster for wider family of
> workloads. It speed up gcc benchmark and most other programs.
>
> On unit tests revised version is somewhat slower than previous version.
> It is caused by choosing first 16 bytes only rarely which causes branch
> misprediction.
>
> I did two additional small improvements, first is squashing padding patch.
> Second bit is test to cross page can be done as x%4096 < 4096-48 instead
> x%4096 <= 4096-64 because I align x into 16 bytes.
>
> I updated benchmarks, difference between new and revised version is at
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html
> http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
>
>
> Ondra
>
> 2013-01-31  Ondrej Bilka  <[hidden email]>
>
>    * sysdeps/x86_64/strlen.S: Replace with new SSE2 based
>    implementation which is faster on all x86_64 architectures.
>    Tested on AMD, Intel Nehalem, SNB, IVB.
>    * sysdeps/x86_64/strnlen.S: Likewise.
>
>    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
>    Remove all multiarch strlen and strnlen versions.
>    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update.
>    Remove strlen and strnlen related parts.
>
>    * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Update.
>    Inline strlen part.
>    * sysdeps/x86_64/multiarch/strcat-ssse3.S: Likewise.
>
>    * sysdeps/x86_64/multiarch/strlen.S: Remove.
>    * sysdeps/x86_64/multiarch/strlen-sse2-no-bsf.S: Remove.
>    * sysdeps/x86_64/multiarch/strlen-sse2-pminub.S: Remove.
>    * sysdeps/x86_64/multiarch/rtld-strlen.S: Remove.
>    * sysdeps/x86_64/multiarch/strlen-sse4.S: Remove.
>    * sysdeps/x86_64/multiarch/strnlen.S: Remove.
>    * sysdeps/x86_64/multiarch/strnlen-sse2-no-bsf.S: Remove.
>
> From e7b469b93cf4cfd6475b644b0f2d72b8ae47170f Mon Sep 17 00:00:00 2001
> From: Ondrej Bilka <[hidden email]>
> Date: Wed, 30 Jan 2013 18:13:22 +0100
> Subject: [PATCH] Faster strlen on x86-64.
>
> ---
>  sysdeps/x86_64/multiarch/Makefile                |    6 +-
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c       |   13 -
>  sysdeps/x86_64/multiarch/rtld-strlen.S           |    1 -
>  sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S |  229 +++++++-
>  sysdeps/x86_64/multiarch/strcat-ssse3.S          |  312 ++++++++++-
>  sysdeps/x86_64/multiarch/strlen-sse2-no-bsf.S    |  685 ----------------------
>  sysdeps/x86_64/multiarch/strlen-sse2-pminub.S    |  259 --------
>  sysdeps/x86_64/multiarch/strlen-sse4.S           |   84 ---
>  sysdeps/x86_64/multiarch/strlen.S                |   68 ---
>  sysdeps/x86_64/multiarch/strnlen-sse2-no-bsf.S   |    3 -
>  sysdeps/x86_64/multiarch/strnlen.S               |   57 --
>  sysdeps/x86_64/strlen.S                          |  265 +++++++--
>  sysdeps/x86_64/strnlen.S                         |   66 +--
>  13 files changed, 742 insertions(+), 1306 deletions(-)
>  delete mode 100644 sysdeps/x86_64/multiarch/rtld-strlen.S
>  delete mode 100644 sysdeps/x86_64/multiarch/strlen-sse2-no-bsf.S
>  delete mode 100644 sysdeps/x86_64/multiarch/strlen-sse2-pminub.S
>  delete mode 100644 sysdeps/x86_64/multiarch/strlen-sse4.S
>  delete mode 100644 sysdeps/x86_64/multiarch/strlen.S
>  delete mode 100644 sysdeps/x86_64/multiarch/strnlen-sse2-no-bsf.S
>  delete mode 100644 sysdeps/x86_64/multiarch/strnlen.S
>
> diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> index dd6c27d..67686ad 100644
> --- a/sysdeps/x86_64/multiarch/Makefile
> +++ b/sysdeps/x86_64/multiarch/Makefile
> @@ -10,14 +10,12 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c strcmp-ssse3 strncmp-ssse3 \
>     strend-sse4 memcmp-sse4 memcpy-ssse3 mempcpy-ssse3 \
>     memmove-ssse3 memcpy-ssse3-back mempcpy-ssse3-back \
>     memmove-ssse3-back strcasestr-nonascii strcasecmp_l-ssse3 \
> -   strncase_l-ssse3 strlen-sse4 strlen-sse2-no-bsf memset-x86-64 \
> +   strncase_l-ssse3 memset-x86-64 strcat-ssse3 strncat-ssse3\
>     strcpy-ssse3 strncpy-ssse3 stpcpy-ssse3 stpncpy-ssse3 \
>     strcpy-sse2-unaligned strncpy-sse2-unaligned \
>     stpcpy-sse2-unaligned stpncpy-sse2-unaligned \
>     strcat-sse2-unaligned strncat-sse2-unaligned \
> -   strcat-ssse3 strncat-ssse3 strlen-sse2-pminub \
> -   strnlen-sse2-no-bsf strrchr-sse2-no-bsf strchr-sse2-no-bsf \
> -   memcmp-ssse3
> +   strrchr-sse2-no-bsf strchr-sse2-no-bsf memcmp-ssse3
>  ifeq (yes,$(config-cflags-sse4))
>  sysdep_routines += strcspn-c strpbrk-c strspn-c strstr-c strcasestr-c varshift
>  CFLAGS-varshift.c += -msse4
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index 643cb2d..848991e 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -187,11 +187,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>        __strncpy_sse2_unaligned)
>        IFUNC_IMPL_ADD (array, i, strncpy, 1, __strncpy_sse2))
>
> -  /* Support sysdeps/x86_64/multiarch/strnlen.S.  */
> -  IFUNC_IMPL (i, name, strnlen,
> -      IFUNC_IMPL_ADD (array, i, strnlen, 1, __strnlen_sse2_no_bsf)
> -      IFUNC_IMPL_ADD (array, i, strnlen, 1, __strnlen_sse2))
> -
>    /* Support sysdeps/x86_64/multiarch/strpbrk.S.  */
>    IFUNC_IMPL (i, name, strpbrk,
>        IFUNC_IMPL_ADD (array, i, strpbrk, HAS_SSE4_2,
> @@ -262,14 +257,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>        __mempcpy_ssse3)
>        IFUNC_IMPL_ADD (array, i, mempcpy, 1, __mempcpy_sse2))
>
> -  /* Support sysdeps/x86_64/multiarch/strlen.S.  */
> -  IFUNC_IMPL (i, name, strlen,
> -      IFUNC_IMPL_ADD (array, i, strlen, HAS_SSE4_2, __strlen_sse42)
> -      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2_pminub)
> -      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2_no_bsf)
> -      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2)
> -      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2))
> -
>    /* Support sysdeps/x86_64/multiarch/strncmp.S.  */
>    IFUNC_IMPL (i, name, strncmp,
>        IFUNC_IMPL_ADD (array, i, strncmp, HAS_SSE4_2,
> diff --git a/sysdeps/x86_64/multiarch/rtld-strlen.S b/sysdeps/x86_64/multiarch/rtld-strlen.S
> deleted file mode 100644
> index 596e054..0000000
> --- a/sysdeps/x86_64/multiarch/rtld-strlen.S
> +++ /dev/null
> @@ -1 +0,0 @@
> -#include "../rtld-strlen.S"
> diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> index 72bb609..6d9951e 100644
> --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> @@ -34,10 +34,233 @@ ENTRY (STRCAT)
>   mov %rdx, %r8
>  # endif
>
> -# define RETURN  jmp L(StartStrcpyPart)
> -# include "strlen-sse2-pminub.S"
> -# undef RETURN

You say that you inline the strlen part in the changes entry but I do
not see this from the function.

Please add comments here, it's not clear what this code does at all.

Explain at the beginning the algorithm used, and explain what's
happening in the code.

> + xor %rax, %rax
> + mov %edi, %ecx
> + and $0x3f, %ecx
> + pxor %xmm0, %xmm0
> + cmp $0x30, %ecx
> + ja L(next)
> + movdqu (%rdi), %xmm1
> + pcmpeqb %xmm1, %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jnz L(exit_less16)
> + mov %rdi, %rax
> + and $-16, %rax
> + jmp L(align16_start)
> +L(next):
> + mov %rdi, %rax
> + and $-16, %rax
> + pcmpeqb (%rax), %xmm0
> + mov $-1, %r10d
> + sub %rax, %rcx
> + shl %cl, %r10d
> + pmovmskb %xmm0, %edx
> + and %r10d, %edx
> + jnz L(exit)
>
> +L(align16_start):
> + pxor %xmm0, %xmm0
> + pxor %xmm1, %xmm1
> + pxor %xmm2, %xmm2
> + pxor %xmm3, %xmm3
> + pcmpeqb 16(%rax), %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jnz L(exit16)
> +
> + pcmpeqb 32(%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + jnz L(exit32)
> +
> + pcmpeqb 48(%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + jnz L(exit48)
> +
> + pcmpeqb 64(%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + jnz L(exit64)
> +
> + pcmpeqb 80(%rax), %xmm0
> + add $64, %rax
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jnz L(exit16)
> +
> + pcmpeqb 32(%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + jnz L(exit32)
> +
> + pcmpeqb 48(%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + jnz L(exit48)
> +
> + pcmpeqb 64(%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + jnz L(exit64)
> +
> + pcmpeqb 80(%rax), %xmm0
> + add $64, %rax
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jnz L(exit16)
> +
> + pcmpeqb 32(%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + jnz L(exit32)
> +
> + pcmpeqb 48(%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + jnz L(exit48)
> +
> + pcmpeqb 64(%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + jnz L(exit64)
> +
> + pcmpeqb 80(%rax), %xmm0
> + add $64, %rax
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jnz L(exit16)
> +
> + pcmpeqb 32(%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + jnz L(exit32)
> +
> + pcmpeqb 48(%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + jnz L(exit48)
> +
> + pcmpeqb 64(%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + jnz L(exit64)
> +
> + test $0x3f, %rax
> + jz L(align64_loop)
> +
> + pcmpeqb 80(%rax), %xmm0
> + add $80, %rax
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jnz L(exit)
> +
> + test $0x3f, %rax
> + jz L(align64_loop)
> +
> + pcmpeqb 16(%rax), %xmm1
> + add $16, %rax
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + jnz L(exit)
> +
> + test $0x3f, %rax
> + jz L(align64_loop)
> +
> + pcmpeqb 16(%rax), %xmm2
> + add $16, %rax
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + jnz L(exit)
> +
> + test $0x3f, %rax
> + jz L(align64_loop)
> +
> + pcmpeqb 16(%rax), %xmm3
> + add $16, %rax
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + jnz L(exit)
> +
> + add $16, %rax
> + .p2align 4
> + L(align64_loop):
> + movaps (%rax), %xmm4
> + pminub 16(%rax), %xmm4
> + movaps 32(%rax), %xmm5
> + pminub 48(%rax), %xmm5
> + add $64, %rax
> + pminub %xmm4, %xmm5
> + pcmpeqb %xmm0, %xmm5
> + pmovmskb %xmm5, %edx
> + test %edx, %edx
> + jz L(align64_loop)
> +
> + pcmpeqb -64(%rax), %xmm0
> + sub $80, %rax
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jnz L(exit16)
> +
> + pcmpeqb 32(%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + jnz L(exit32)
> +
> + pcmpeqb 48(%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + jnz L(exit48)
> +
> + pcmpeqb 64(%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + sub %rdi, %rax
> + bsf %rdx, %rdx
> + add %rdx, %rax
> + add $64, %rax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit):
> + sub %rdi, %rax
> +L(exit_less16):
> + bsf %rdx, %rdx
> + add %rdx, %rax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit16):
> + sub %rdi, %rax
> + bsf %rdx, %rdx
> + add %rdx, %rax
> + add $16, %rax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit32):
> + sub %rdi, %rax
> + bsf %rdx, %rdx
> + add %rdx, %rax
> + add $32, %rax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit48):
> + sub %rdi, %rax
> + bsf %rdx, %rdx
> + add %rdx, %rax
> + add $48, %rax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit64):
> + sub %rdi, %rax
> + bsf %rdx, %rdx
> + add %rdx, %rax
> + add $64, %rax
> +
> + .p2align 4
>  L(StartStrcpyPart):
>   lea (%r9, %rax), %rdi
>   mov %rsi, %rcx
> diff --git a/sysdeps/x86_64/multiarch/strcat-ssse3.S b/sysdeps/x86_64/multiarch/strcat-ssse3.S
> index fea9d11..901e66f 100644
> --- a/sysdeps/x86_64/multiarch/strcat-ssse3.S
> +++ b/sysdeps/x86_64/multiarch/strcat-ssse3.S
> @@ -33,11 +33,317 @@ ENTRY (STRCAT)
>   mov %rdx, %r8
>  # endif
>
> -# define RETURN  jmp L(StartStrcpyPart)
> -# include "strlen-sse2-no-bsf.S"

Same here, this code needs comments

> + xor %eax, %eax
> + cmpb $0, (%rdi)
> + jz L(exit_tail0)
> + cmpb $0, 1(%rdi)
> + jz L(exit_tail1)
> + cmpb $0, 2(%rdi)
> + jz L(exit_tail2)
> + cmpb $0, 3(%rdi)
> + jz L(exit_tail3)
> +
> + cmpb $0, 4(%rdi)
> + jz L(exit_tail4)
> + cmpb $0, 5(%rdi)
> + jz L(exit_tail5)
> + cmpb $0, 6(%rdi)
> + jz L(exit_tail6)
> + cmpb $0, 7(%rdi)
> + jz L(exit_tail7)
> +
> + cmpb $0, 8(%rdi)
> + jz L(exit_tail8)
> + cmpb $0, 9(%rdi)
> + jz L(exit_tail9)
> + cmpb $0, 10(%rdi)
> + jz L(exit_tail10)
> + cmpb $0, 11(%rdi)
> + jz L(exit_tail11)
> +
> + cmpb $0, 12(%rdi)
> + jz L(exit_tail12)
> + cmpb $0, 13(%rdi)
> + jz L(exit_tail13)
> + cmpb $0, 14(%rdi)
> + jz L(exit_tail14)
> + cmpb $0, 15(%rdi)
> + jz L(exit_tail15)
> + pxor %xmm0, %xmm0
> + lea 16(%rdi), %rcx
> + lea 16(%rdi), %rax
> + and $-16, %rax
> +
> + pcmpeqb (%rax), %xmm0
> + pmovmskb %xmm0, %edx
> + pxor %xmm1, %xmm1
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + pxor %xmm2, %xmm2
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + pxor %xmm3, %xmm3
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm1
> + pmovmskb %xmm1, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm2
> + pmovmskb %xmm2, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + pcmpeqb (%rax), %xmm3
> + pmovmskb %xmm3, %edx
> + test %edx, %edx
> + lea 16(%rax), %rax
> + jnz L(exit)
> +
> + and $-0x40, %rax
>
> -# undef RETURN
> + .p2align 4
> +L(aligned_64):
> + pcmpeqb (%rax), %xmm0
> + pcmpeqb 16(%rax), %xmm1
> + pcmpeqb 32(%rax), %xmm2
> + pcmpeqb 48(%rax), %xmm3
> + pmovmskb %xmm0, %edx
> + pmovmskb %xmm1, %r11d
> + pmovmskb %xmm2, %r10d
> + pmovmskb %xmm3, %r9d
> + or %edx, %r9d
> + or %r11d, %r9d
> + or %r10d, %r9d
> + lea 64(%rax), %rax
> + jz L(aligned_64)
> +
> + test %edx, %edx
> + jnz L(aligned_64_exit_16)
> + test %r11d, %r11d
> + jnz L(aligned_64_exit_32)
> + test %r10d, %r10d
> + jnz L(aligned_64_exit_48)
> +
> +L(aligned_64_exit_64):
> + pmovmskb %xmm3, %edx
> + jmp L(exit)
> +
> +L(aligned_64_exit_48):
> + lea -16(%rax), %rax
> + mov %r10d, %edx
> + jmp L(exit)
> +
> +L(aligned_64_exit_32):
> + lea -32(%rax), %rax
> + mov %r11d, %edx
> + jmp L(exit)
> +
> +L(aligned_64_exit_16):
> + lea -48(%rax), %rax
> +
> +L(exit):
> + sub %rcx, %rax
> + test %dl, %dl
> + jz L(exit_high)
> + test $0x01, %dl
> + jnz L(exit_tail0)
> +
> + test $0x02, %dl
> + jnz L(exit_tail1)
> +
> + test $0x04, %dl
> + jnz L(exit_tail2)
> +
> + test $0x08, %dl
> + jnz L(exit_tail3)
> +
> + test $0x10, %dl
> + jnz L(exit_tail4)
> +
> + test $0x20, %dl
> + jnz L(exit_tail5)
> +
> + test $0x40, %dl
> + jnz L(exit_tail6)
> + add $7, %eax
> +L(exit_tail0):
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_high):
> + add $8, %eax
> + test $0x01, %dh
> + jnz L(exit_tail0)
> +
> + test $0x02, %dh
> + jnz L(exit_tail1)
> +
> + test $0x04, %dh
> + jnz L(exit_tail2)
> +
> + test $0x08, %dh
> + jnz L(exit_tail3)
> +
> + test $0x10, %dh
> + jnz L(exit_tail4)
> +
> + test $0x20, %dh
> + jnz L(exit_tail5)
> +
> + test $0x40, %dh
> + jnz L(exit_tail6)
> + add $7, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail1):
> + add $1, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail2):
> + add $2, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail3):
> + add $3, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail4):
> + add $4, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail5):
> + add $5, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail6):
> + add $6, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail7):
> + add $7, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail8):
> + add $8, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail9):
> + add $9, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail10):
> + add $10, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail11):
> + add $11, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail12):
> + add $12, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail13):
> + add $13, %eax
> + jmp L(StartStrcpyPart)
>
> + .p2align 4
> +L(exit_tail14):
> + add $14, %eax
> + jmp L(StartStrcpyPart)
> +
> + .p2align 4
> +L(exit_tail15):
> + add $15, %eax
> +
> + .p2align 4
>  L(StartStrcpyPart):
>   mov %rsi, %rcx
>   lea (%rdi, %rax), %rdx

> diff --git a/sysdeps/x86_64/strlen.S b/sysdeps/x86_64/strlen.S
> index 4bdca0a..c8ced10 100644
> --- a/sysdeps/x86_64/strlen.S
> +++ b/sysdeps/x86_64/strlen.S
> @@ -1,6 +1,5 @@
> -/* strlen(str) -- determine the length of the string STR.
> -   Copyright (C) 2009-2013 Free Software Foundation, Inc.
> -   Contributed by Ulrich Drepper <[hidden email]>.
> +/* SSE2 version of strlen.
> +   Copyright (C) 2012, 2013 Free Software Foundation, Inc.

It's 2012-2013

>     This file is part of the GNU C Library.
>
>     The GNU C Library is free software; you can redistribute it and/or
> @@ -19,83 +18,219 @@
>
>  #include <sysdep.h>
>
> +/* Used in linker - use only %xmm8-%xmm15.  */

Better expand
Say explictely - if it's true: This is used inside the dynamic linker
during relocation, thus we cannot touch the usual registers. We can use
%xmm8 to %xmm15.

> - .text
> +/* Long lived register are
> + strlen(s), strnlen(s, n):

No line break here.
Rewrite this a bit. What about?

Long lived register usage for strlen(s), strnlen(s, n):

Still not perfect but I don't have a better idea.

> +
> + %xmm11 - zero
> + %rdi   - s
> + %r10  (s+n) & (~(64-1))
don't add a tab here, make it (s+n) & (~64-1))
> + %r11   s+n


> +*/
> +
> +
> +.text
>  ENTRY(strlen)
> +

this needs a comment. What is this macro doing?

> +#define FIND_ZERO \
> + pcmpeqb (%rax), %xmm8; \
> + pcmpeqb 16(%rax), %xmm9; \
> + pcmpeqb 32(%rax), %xmm10; \
> + pcmpeqb 48(%rax), %xmm11; \
> + pmovmskb %xmm8, %esi; \
> + pmovmskb %xmm9, %edx; \
> + pmovmskb %xmm10, %r8d; \
> + pmovmskb %xmm11, %ecx; \
> + salq $16, %rdx; \
> + salq $16, %rcx; \
> + orq %rsi, %rdx; \
> + orq %r8, %rcx; \
> + salq $32, %rcx; \
> + orq %rcx, %rdx;
> +
> +#ifdef AS_STRNLEN
> +/* Do not read anything when n==0.  */
> + test %rsi, %rsi
> + jne L(n_nonzero)
>   xor %rax, %rax
> - mov %edi, %ecx
> - and $0x3f, %ecx
> - pxor %xmm0, %xmm0
> - cmp $0x30, %ecx
> + ret
> +L(n_nonzero):
> +
> +/* Initialize long lived registers.  */
> +
> + add %rdi, %rsi
> + mov %rsi, %r10
> + and $-64, %r10
> + mov %rsi, %r11
> +#endif
> +
> + pxor %xmm8, %xmm8
> + pxor %xmm9, %xmm9
> + pxor %xmm10, %xmm10
> + pxor %xmm11, %xmm11
> + movq %rdi, %rax
> + movq %rdi, %rcx
> + andq $4095, %rcx
> +/* Offsets 4032-4047 will be aligned into 4032 thus fit into page.  */
> + cmpq $4047, %rcx
> +/* We cannot unify this branching as it would be ~6 cycles slower.  */
>   ja L(next)
> - movdqu (%rdi), %xmm1
> - pcmpeqb %xmm1, %xmm0
> - pmovmskb %xmm0, %edx
> +
> +#ifdef AS_STRNLEN

What is this prolog doing?

> +# define STRNLEN_PROLOG \
> + mov %r11, %rsi; \
> + subq %rax, %rsi; \
> + andq $-64, %rax; \
> + testq $-64, %rsi; \
> + je L(strnlen_ret)
> +#else
> +# define STRNLEN_PROLOG  andq $-64, %rax;
> +#endif
> +

And this one? Please document!

> +#define PROLOG(lab) \
> + movq %rdi, %rcx; \
> + xorq %rax, %rcx; \
> + STRNLEN_PROLOG; \
> + sarq %cl, %rdx; \
> + test %rdx, %rdx; \
> + je L(lab); \
> + bsfq %rdx, %rax; \
> + ret
> +
> +#ifdef AS_STRNLEN
> + andq $-16, %rax
> + FIND_ZERO
> +#else
> + movdqu (%rax), %xmm12
> + pcmpeqb %xmm8, %xmm12
> + pmovmskb %xmm12, %edx
>   test %edx, %edx
> - jnz L(exit_less16)
> - mov %rdi, %rax
> - and $-16, %rax
> - jmp L(align16_start)
> + je L(next48_bytes)
> + bsfq %rdx, %rax
> + ret
> +
> +L(next48_bytes):
> +/* Same as FIND_ZERO except we do not check first 16 bytes.  */
> + andq $-16, %rax
> + pcmpeqb 16(%rax), %xmm9;
> + pcmpeqb 32(%rax), %xmm10;
> + pcmpeqb 48(%rax), %xmm11;
> + pmovmskb %xmm9, %edx;
> + pmovmskb %xmm10, %r8d;
> + pmovmskb %xmm11, %ecx;
> + salq $16, %rdx;
> + salq $16, %rcx;
> + orq %r8, %rcx;
> + salq $32, %rcx;
> + orq %rcx, %rdx;
> +#endif
> +
> + PROLOG(loop)
> +
> + .p2align 4
>  L(next):
> - mov %rdi, %rax
> - and $-16, %rax
> - pcmpeqb (%rax), %xmm0
> - mov $-1, %esi
> - sub %rax, %rcx
> - shl %cl, %esi
> - pmovmskb %xmm0, %edx
> - and %esi, %edx
> - jnz L(exit)
> -L(align16_start):
> - pxor %xmm0, %xmm0
> - pxor %xmm1, %xmm1
> - pxor %xmm2, %xmm2
> - pxor %xmm3, %xmm3
> + andq $-64, %rax
> + FIND_ZERO
> + PROLOG(loop_init)
> +
> +#ifdef AS_STRNLEN
> +/* We must do this check to correctly handle strnlen (s, -1).  */
> +L(strnlen_ret):
> + bts %rsi, %rdx
> + sarq %cl, %rdx
> + test %rdx, %rdx
> + je L(loop_init)
> + bsfq %rdx, %rax
> + ret
> +#endif
>   .p2align 4
> -L(align16_loop):
> - pcmpeqb 16(%rax), %xmm0
> - pmovmskb %xmm0, %edx
> - test %edx, %edx
> - jnz L(exit16)
> +L(loop_init):
> + pxor %xmm9, %xmm9
> + pxor %xmm10, %xmm10
> + pxor %xmm11, %xmm11
> +#ifdef AS_STRNLEN
> + .p2align 4
> +L(loop):
>
> - pcmpeqb 32(%rax), %xmm1
> - pmovmskb %xmm1, %edx
> - test %edx, %edx
> - jnz L(exit32)
> + addq $64, %rax
> + cmpq %rax, %r10
> + je L(exit_end)
>
> - pcmpeqb 48(%rax), %xmm2
> - pmovmskb %xmm2, %edx
> - test %edx, %edx
> - jnz L(exit48)
> + movdqa (%rax), %xmm8
> + pminub 16(%rax), %xmm8
> + pminub 32(%rax), %xmm8
> + pminub 48(%rax), %xmm8
> + pcmpeqb %xmm11, %xmm8
> + pmovmskb %xmm8, %edx
> + testl %edx, %edx
> + jne L(exit)
> + jmp L(loop)
>
> - pcmpeqb 64(%rax), %xmm3
> - pmovmskb %xmm3, %edx
> - lea 64(%rax), %rax
> - test %edx, %edx
> - jz L(align16_loop)
> -L(exit):
> - sub %rdi, %rax
> -L(exit_less16):
> - bsf %rdx, %rdx
> - add %rdx, %rax
> - ret
>   .p2align 4
> -L(exit16):
> - sub %rdi, %rax
> - bsf %rdx, %rdx
> - lea 16(%rdx,%rax), %rax
> +L(exit_end):
> + cmp %rax, %r11
> + je L(first)
> + pxor %xmm8, %xmm8
> + FIND_ZERO
> +
> +L(first):
> + bts %r11, %rdx
> + bsfq %rdx, %rdx
> + addq %rdx, %rax
> + subq %rdi, %rax
>   ret
> +
>   .p2align 4
> -L(exit32):
> - sub %rdi, %rax
> - bsf %rdx, %rdx
> - lea 32(%rdx,%rax), %rax
> +L(exit):
> + pxor %xmm8, %xmm8
> + FIND_ZERO
> +
> + bsfq %rdx, %rdx
> + addq %rdx, %rax
> + subq %rdi, %rax
>   ret
> +
> +#else
> + .p2align 4
> +L(loop):
> +
> + movdqa 64(%rax), %xmm8
> + pminub 80(%rax), %xmm8
> + pminub 96(%rax), %xmm8
> + pminub 112(%rax), %xmm8
> + pcmpeqb %xmm11, %xmm8
> + pmovmskb %xmm8, %edx
> + testl %edx, %edx
> + jne L(exit64)
> +
> + subq $-128, %rax
> +
> + movdqa (%rax), %xmm8
> + pminub 16(%rax), %xmm8
> + pminub 32(%rax), %xmm8
> + pminub 48(%rax), %xmm8
> + pcmpeqb %xmm11, %xmm8
> + pmovmskb %xmm8, %edx
> + testl %edx, %edx
> + jne L(exit0)
> + jmp L(loop)
> +
>   .p2align 4
> -L(exit48):
> - sub %rdi, %rax
> - bsf %rdx, %rdx
> - lea 48(%rdx,%rax), %rax
> +L(exit64):
> + addq $64, %rax
> +L(exit0):
> + pxor %xmm8, %xmm8
> + FIND_ZERO
> +
> + bsfq %rdx, %rdx
> + addq %rdx, %rax
> + subq %rdi, %rax
>   ret
> +
> +#endif
> +
>  END(strlen)
> +#ifndef AS_STRLEN
>  libc_hidden_builtin_def (strlen)
> +#endif
> diff --git a/sysdeps/x86_64/strnlen.S b/sysdeps/x86_64/strnlen.S
> index 6e53503..d0694a5 100644
> --- a/sysdeps/x86_64/strnlen.S
> +++ b/sysdeps/x86_64/strnlen.S
> @@ -1,63 +1,7 @@
> -/* strnlen(str,maxlen) -- determine the length of the string STR up to MAXLEN.
> -   Copyright (C) 2010-2013 Free Software Foundation, Inc.
> -   Contributed by Ulrich Drepper <[hidden email]>.
> -   This file is part of the GNU C Library.

Please add the usual copyright etc. headers - even for a trivial file.

> +#define AS_STRNLEN
> +#define strlen __strnlen
> +#include "strlen.S"
>
> -   The GNU C Library is free software; you can redistribute it and/or
> -   modify it under the terms of the GNU Lesser General Public
> -   License as published by the Free Software Foundation; either
> -   version 2.1 of the License, or (at your option) any later version.
> +weak_alias (__strnlen, strnlen);
> +libc_hidden_builtin_def (strnlen)
> [...]

Please add more comments, especially where I noted that and resend for
another review,

thanks,
Andreas
--
  Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
   SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
    GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
     GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4] faster strlen on x64

Richard Henderson
On 2013-03-06 13:54, Andreas Jaeger wrote:
>> +/* Used in linker - use only %xmm8-%xmm15.  */
>
> Better expand
> Say explictely - if it's true: This is used inside the dynamic linker during
> relocation, thus we cannot touch the usual registers. We can use %xmm8 to %xmm15.

I did already object to using *any* xmm registers in the dynamic linker.

I know for a fact that Intel has a __regcall calling convention in place that
uses all 16 xmm registers for parameter passing.


r~
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Ondřej Bílka
In reply to this post by Andreas Jaeger-8
On Wed, Mar 06, 2013 at 10:26:05PM +0100, Andreas Jaeger wrote:

> On 03/06/2013 10:00 PM, Ondřej Bílka wrote:
> >On Tue, Mar 05, 2013 at 06:39:15PM +0100, Ondřej Bílka wrote:
> >>
> >>I uploaded results to my page.
> >>If nobody has objection I will commit it tomorrow.
> >Commited.
>
> Ondrej, I feel ashamed that nobody had time to review your patches in time.
>
> Still, our rules ask for review by somebody else and thus I'm
> surprised by your commit. Silent approval ("if nobody objects")
> without any kind of review and approval is not the way to go.

It served purpose of deadlines to get (late I expected someone
would answer sooner) response.
>
> Please revert the patch and let's give it a proper review with
> explicit aprovals before you commit!
>
> I'll do a first review now.
>
> Next time if nobody reviews your submission, please ask gently again
> and again...
>
I estimate that if I will did not hold back patches I would have around
hundred in flight and ping them each week.

Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Carlos O'Donell-2
On Wed, Mar 6, 2013 at 5:37 PM, Ondřej Bílka <[hidden email]> wrote:
> I estimate that if I will did not hold back patches I would have around
> hundred in flight and ping them each week.

I encourage you to do the following:

* Ping your patches. However, flooding the list with pings will not help.
Please be respectful. Choose one patch and see it through to commit.

* Work to gather consensus from the interested parties. You should get
consensus from Dmitrieva Liubov, and the new machine maintainers
for x86; which are now Andreas, and myself.

* Assist in reviewing other patches, such that other reviewers are
freed up to review your patches. This will also help other developers
gain confidence in your abilities and build trust.

* Get involved in the effort to build a microbenchmark framework in glibc.
Siddhesh Poyarekar has started the effort here:
http://sourceware.org/ml/libc-alpha/2013-01/msg00235.html
Your experience would really help. We need an integrated and
reproducible way to measure the performance impact of code changes.

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4] faster strlen on x64

Ondřej Bílka
In reply to this post by Richard Henderson
On Wed, Mar 06, 2013 at 02:25:24PM -0800, Richard Henderson wrote:

> On 2013-03-06 13:54, Andreas Jaeger wrote:
> >>+/* Used in linker - use only %xmm8-%xmm15.  */
> >
> >Better expand
> >Say explictely - if it's true: This is used inside the dynamic linker during
> >relocation, thus we cannot touch the usual registers. We can use %xmm8 to %xmm15.
>
> I did already object to using *any* xmm registers in the dynamic linker.
>
> I know for a fact that Intel has a __regcall calling convention in
> place that uses all 16 xmm registers for parameter passing.
>
A press relase introducing it dates 13.2.2013. I moved rtld to separate
patch.
>
> r~

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4] faster strlen on x64

Ondřej Bílka
In reply to this post by Andreas Jaeger-8
On Wed, Mar 06, 2013 at 10:54:11PM +0100, Andreas Jaeger wrote:
> On 02/13/2013 12:38 PM, Ondřej Bílka wrote:

> >-# define RETURN  jmp L(StartStrcpyPart)
> >-# include "strlen-sse2-pminub.S"
> >-# undef RETURN
>
> You say that you inline the strlen part in the changes entry but I
> do not see this from the function.
>
> Please add comments here, it's not clear what this code does at all.
>
> Explain at the beginning the algorithm used, and explain what's
> happening in the code.
>
It inlines strlen exactly as in comment. This is done by including
strlen-sse2-pminub.S as it is and expanding RETURN.

It is a temporary solution to resolve dependency of strcat on strlen.
Alternative is to keep strlen-sse2-pminub.S and strcat files.

I delayed proper strcat patch as it also depends on strcpy as I
described in appropriate threads.

snip.

>
> this needs a comment. What is this macro doing?
>
Creates bitmask in %rdx that has i-th bit set if byte %rax[i] is zero.

> >+#define FIND_ZERO \
> >+ pcmpeqb (%rax), %xmm8; \
> >+ pcmpeqb 16(%rax), %xmm9; \
> >+ pcmpeqb 32(%rax), %xmm10; \
> >+ pcmpeqb 48(%rax), %xmm11; \
> >+ pmovmskb %xmm8, %esi; \
> >+ pmovmskb %xmm9, %edx; \
> >+ pmovmskb %xmm10, %r8d; \
> >+ pmovmskb %xmm11, %ecx; \
> >+ salq $16, %rdx; \
> >+ salq $16, %rcx; \
> >+ orq %rsi, %rdx; \
> >+ orq %r8, %rcx; \
> >+ salq $32, %rcx; \
> >+ orq %rcx, %rdx;
> >+
>
> What is this prolog doing?
Tests if end condition happened.

>
> >+# define STRNLEN_PROLOG \
> >+ mov %r11, %rsi; \
> >+ subq %rax, %rsi; \
> >+ andq $-64, %rax; \
> >+ testq $-64, %rsi; \
> >+ je L(strnlen_ret)
> >+#else
> >+# define STRNLEN_PROLOG  andq $-64, %rax;
> >+#endif
> >+
>
> And this one? Please document!
Avoids duplication as code for common case and crossing page are nearly
identical. I do not have better explanation.

> >+#define PROLOG(lab) \
> >+ movq %rdi, %rcx; \
> >+ xorq %rax, %rcx; \
> >+ STRNLEN_PROLOG; \
> >+ sarq %cl, %rdx; \
> >+ test %rdx, %rdx; \
> >+ je L(lab); \
> >+ bsfq %rdx, %rax; \
> >+ ret
> >+

Rest tommorow
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Liubov Dmitrieva
In reply to this post by Carlos O'Donell-2
Hello.

I've reproduced performance measurement using Ondrej benchmarks on the
IA CPUs the most important for us and can confirm that this patch is
fine.
It seems that this conversation was kept in private messages and
didn't get publicity.
I'm sorry for that.

--
Liubov Dmitrieva

2013/3/7 Carlos O'Donell <[hidden email]>:

> On Wed, Mar 6, 2013 at 5:37 PM, Ondřej Bílka <[hidden email]> wrote:
>> I estimate that if I will did not hold back patches I would have around
>> hundred in flight and ping them each week.
>
> I encourage you to do the following:
>
> * Ping your patches. However, flooding the list with pings will not help.
> Please be respectful. Choose one patch and see it through to commit.
>
> * Work to gather consensus from the interested parties. You should get
> consensus from Dmitrieva Liubov, and the new machine maintainers
> for x86; which are now Andreas, and myself.
>
> * Assist in reviewing other patches, such that other reviewers are
> freed up to review your patches. This will also help other developers
> gain confidence in your abilities and build trust.
>
> * Get involved in the effort to build a microbenchmark framework in glibc.
> Siddhesh Poyarekar has started the effort here:
> http://sourceware.org/ml/libc-alpha/2013-01/msg00235.html
> Your experience would really help. We need an integrated and
> reproducible way to measure the performance impact of code changes.
>
> Cheers,
> Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Andreas Jaeger-8
On 03/07/2013 09:38 AM, Dmitrieva Liubov wrote:
> Hello.
>
> I've reproduced performance measurement using Ondrej benchmarks on the
> IA CPUs the most important for us and can confirm that this patch is
> fine.
> It seems that this conversation was kept in private messages and
> didn't get publicity.
> I'm sorry for that.

Thanks for confirming that the patches are a real benefit - now let's
get them in a good shape (readable)...

Andreas
--
  Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
   SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
    GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
     GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Ondřej Bílka
On Thu, Mar 07, 2013 at 04:09:27PM +0100, Andreas Jaeger wrote:

> On 03/07/2013 09:38 AM, Dmitrieva Liubov wrote:
> >Hello.
> >
> >I've reproduced performance measurement using Ondrej benchmarks on the
> >IA CPUs the most important for us and can confirm that this patch is
> >fine.
> >It seems that this conversation was kept in private messages and
> >didn't get publicity.
> >I'm sorry for that.
>
> Thanks for confirming that the patches are a real benefit - now
> let's get them in a good shape (readable)...
>
> Andreas
> --
>  Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
>   SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
>    GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
>     GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
Here is new version as standalone and diff from previous.

I did one functional change: replace 64bit bsf by 32bit which is
definitely faster.

Rest are added comments.

  * sysdeps/x86_64/strlen.S: Replace with new SSE2 based
  implementation which is faster on all x86_64 architectures.
  Tested on AMD, Intel Nehalem, SNB, IVB.
  * sysdeps/x86_64/strnlen.S: Likewise.

  * sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
  Remove all multiarch strlen and strnlen versions.
  * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update.
  Remove strlen and strnlen related parts.

  * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Update.
  Inline strlen part.
  * sysdeps/x86_64/multiarch/strcat-ssse3.S: Likewise.
        * sysdeps/x86_64/strcat.S: Add comment.

  * sysdeps/x86_64/multiarch/strlen.S: Remove file.
  * sysdeps/x86_64/multiarch/strlen-sse2-no-bsf.S: Likewise.
  * sysdeps/x86_64/multiarch/strlen-sse2-pminub.S: Likewise.
  * sysdeps/x86_64/multiarch/strlen-sse4.S: Likewise.
  * sysdeps/x86_64/multiarch/strnlen.S: Likewise.
  * sysdeps/x86_64/multiarch/strnlen-sse2-no-bsf.S: Likewise.


0001-Faster-strlen-on-x86-64.patch (47K) Download Attachment
comments.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Andreas Jaeger-8
On 03/07/2013 09:37 PM, Ondřej Bílka wrote:

> On Thu, Mar 07, 2013 at 04:09:27PM +0100, Andreas Jaeger wrote:
>> On 03/07/2013 09:38 AM, Dmitrieva Liubov wrote:
>>> Hello.
>>>
>>> I've reproduced performance measurement using Ondrej benchmarks on the
>>> IA CPUs the most important for us and can confirm that this patch is
>>> fine.
>>> It seems that this conversation was kept in private messages and
>>> didn't get publicity.
>>> I'm sorry for that.
>>
>> Thanks for confirming that the patches are a real benefit - now
>> let's get them in a good shape (readable)...
>>
>> Andreas
>> --
>>   Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
>>    SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
>>     GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
>>      GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
>
> Here is new version as standalone and diff from previous.
>
> I did one functional change: replace 64bit bsf by 32bit which is
> definitely faster.
>
> Rest are added comments.
>
>    * sysdeps/x86_64/strlen.S: Replace with new SSE2 based
>    implementation which is faster on all x86_64 architectures.
>    Tested on AMD, Intel Nehalem, SNB, IVB.
>    * sysdeps/x86_64/strnlen.S: Likewise.
>
>    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
>    Remove all multiarch strlen and strnlen versions.
>    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update.
>    Remove strlen and strnlen related parts.

you need to mention the function the changes happen in - and don't say
just update.

this would be fine:
sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
Remove strlen and strnlen related parts.

>    * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Update.
>    Inline strlen part.

No need to say Update - it's clear that you update it ;). Just say
"Inline old strlen version."

>    * sysdeps/x86_64/multiarch/strcat-ssse3.S: Likewise.
> * sysdeps/x86_64/strcat.S: Add comment.
>
>    * sysdeps/x86_64/multiarch/strlen.S: Remove file.
>    * sysdeps/x86_64/multiarch/strlen-sse2-no-bsf.S: Likewise.
>    * sysdeps/x86_64/multiarch/strlen-sse2-pminub.S: Likewise.
>    * sysdeps/x86_64/multiarch/strlen-sse4.S: Likewise.
>    * sysdeps/x86_64/multiarch/strnlen.S: Likewise.
>    * sysdeps/x86_64/multiarch/strnlen-sse2-no-bsf.S: Likewise.
>

Please leave the copyright header for sysdeps/x86_64/strnlen.S.


I think this is fine with the above changes.

But I would like to give Carlos the chance to review it. Carlos, do you
want to review it - or shall this go in as is?

thanks,
Andreas
--
  Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
   SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
    GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
     GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Ondřej Bílka
Carlos, what is you opinion?
On Fri, Mar 08, 2013 at 07:18:09AM +0100, Andreas Jaeger wrote:
> I think this is fine with the above changes.
>
> But I would like to give Carlos the chance to review it. Carlos, do
> you want to review it - or shall this go in as is?
>
> thanks,
> Andreas
> --


Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4] faster strlen on x64

Andreas Jaeger-8
On Monday, March 11, 2013 10:15:27 Ondřej Bílka wrote:
> Carlos, what is you opinion?
>
> On Fri, Mar 08, 2013 at 07:18:09AM +0100, Andreas Jaeger wrote:
> > I think this is fine with the above changes.
> >
> > But I would like to give Carlos the chance to review it. Carlos, do
> > you want to review it - or shall this go in as is?

Ondrej, please submit tomorrow if Carlos has no further comments,

Andreas
--
 Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
  SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
   GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
    GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126

12