[PATCH] Fix up strtod_l.c for -O0

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Fix up strtod_l.c for -O0

Jakub Jelinek
Hi!

I know that glibc currently refuses to compile with -O0, but e.g.
libquadmath that copies code from glibc doesn't enforce that.

As reported in http://gcc.gnu.org/PR56379 , when strtod_l.c is compiled with
-O0, it misbehaves.  The problem is that __mpn_lshift_1's
__builtin_constant_p (count) && count == BITS_PER_MP_LIMB
isn't just an optimization as the comment says.  __mpn_lshift_1
is normally called either with count equal to constant BITS_PER_MP_LIMB,
and then it expects to do the inline loop, or with variable count
that is assumed to be > 0 and < BITS_PER_MP_LIMB.
__mpn_lshift_1 is always_inline, but __builtin_constant_p is just an
optimization, so at -O0 it will evaluate to 0 when the argument isn't a
constant already during the parsing (and only inlining in that case
changes it into a constant).
The reason why the else part doesn't handle count == BITS_PER_MP_LIMB
properly is
1) mpn_lshift itself claims:
   Argument constraints:
   1. 0 < CNT < BITS_PER_MP_LIMB
   2. If the result is to be written over the input, WP must be >= UP.
2) even ignoring that, by most of the assembly implementations it might
   do the right thing, still ptr[0] will be usually unmodified original
   value and ptr[0] |= limb >> (BITS_PER_MP_LIMB - count);
   performs ptr[0] |= limb; rather than ptr[0] = limb; that would be
   desirable for that count.

The reason for __builtin_constant_p is I think that Ulrich didn't want to
bloat the code, while the compiler could sometimes with VRP figure out
that for variable count it will be > 0 and < BITS_PER_MP_LIMB, it doesn't
have to figure that out always and __mpn_lshift_1 code would then be bloated
by handling both cases, even when we know that only one of them is ever
needed at the particular call site.

So, to fix this, either we can change __mpn_lshift_1 into a macro, either
partially using helper inlines as done in the following patch, or fully,
with no inlines at all, or we could keep the always_inline __mpn_lshift_1,
just use
  if (
#ifdef __OPTIMIZE__
      __builtin_constant_p (count)
      &&
#endif
         count == BITS_PER_MP_LIMB)
    {
...
    }
  else
    {
...
    }
so that for -O1+ builds we'd never have extra code in there, but for
-O0 we would.

So, any preferences on this?  Or perhaps nothing at all, and we'll fix this
in libquadmath independently.

2013-02-19  Jakub Jelinek  <[hidden email]>

        * stdlib/strtod_l.c (__mpn_lshift_1_word, __mpn_lshift_1_var): New
        inline functions.
        (__mpn_lshift_1): Change into macro using the above functions.

diff --git a/stdlib/strtod_l.c b/stdlib/strtod_l.c
index 5959354..58d3743 100644
--- a/stdlib/strtod_l.c
+++ b/stdlib/strtod_l.c
@@ -441,32 +441,46 @@ str_to_mpn (const STRING_TYPE *str, int digcnt, mp_limb_t *n, mp_size_t *nsize,
 }
 
 
-/* Shift {PTR, SIZE} COUNT bits to the left, and fill the vacated bits
-   with the COUNT most significant bits of LIMB.
+/* Helper function for __mpn_lshift_1.  Handle the case of shifting by exactly
+   a word: just copy words, with no actual bit-shifting.  This isn't just
+   an optimization, __mpn_lshift_1_var doesn't handle count == BITS_PER_MP_LIMB
+   properly.  */
+static inline void
+__attribute ((always_inline))
+__mpn_lshift_1_word (mp_limb_t *ptr, mp_size_t size, mp_limb_t limb)
+{
+  mp_size_t i;
+  for (i = size - 1; i > 0; --i)
+    ptr[i] = ptr[i - 1];
+  ptr[0] = limb;
+}
 
-   Tege doesn't like this function so I have to write it here myself. :)
-   --drepper */
+/* Helper function for __mpn_lshift_1.  COUNT must be > 0 and
+   < BITS_PER_MP_LIMB.  */
 static inline void
 __attribute ((always_inline))
-__mpn_lshift_1 (mp_limb_t *ptr, mp_size_t size, unsigned int count,
- mp_limb_t limb)
+__mpn_lshift_1_var (mp_limb_t *ptr, mp_size_t size, unsigned int count,
+    mp_limb_t limb)
 {
-  if (__builtin_constant_p (count) && count == BITS_PER_MP_LIMB)
-    {
-      /* Optimize the case of shifting by exactly a word:
- just copy words, with no actual bit-shifting.  */
-      mp_size_t i;
-      for (i = size - 1; i > 0; --i)
- ptr[i] = ptr[i - 1];
-      ptr[0] = limb;
-    }
-  else
-    {
-      (void) __mpn_lshift (ptr, ptr, size, count);
-      ptr[0] |= limb >> (BITS_PER_MP_LIMB - count);
-    }
+  (void) __mpn_lshift (ptr, ptr, size, count);
+  ptr[0] |= limb >> (BITS_PER_MP_LIMB - count);
 }
 
+/* Shift {PTR, SIZE} COUNT bits to the left, and fill the vacated bits
+   with the COUNT most significant bits of LIMB.
+
+   Tege doesn't like this function so I have to write it here myself. :)
+   --drepper */
+#define __mpn_lshift_1(ptr, size, count, limb) \
+  do \
+    { \
+      if (__builtin_constant_p (count) && count == BITS_PER_MP_LIMB) \
+ __mpn_lshift_1_word (ptr, size, limb); \
+      else \
+ __mpn_lshift_1_var (ptr, size, count, limb); \
+    } \
+  while (0)
+
 
 #define INTERNAL(x) INTERNAL1(x)
 #define INTERNAL1(x) __##x##_internal

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix up strtod_l.c for -O0

Carlos O'Donell-6
On 02/19/2013 06:19 AM, Jakub Jelinek wrote:

> Hi!
>
> I know that glibc currently refuses to compile with -O0, but e.g.
> libquadmath that copies code from glibc doesn't enforce that.
>
> As reported in http://gcc.gnu.org/PR56379 , when strtod_l.c is compiled with
> -O0, it misbehaves.  The problem is that __mpn_lshift_1's
> __builtin_constant_p (count) && count == BITS_PER_MP_LIMB
> isn't just an optimization as the comment says.  __mpn_lshift_1
> is normally called either with count equal to constant BITS_PER_MP_LIMB,
> and then it expects to do the inline loop, or with variable count
> that is assumed to be > 0 and < BITS_PER_MP_LIMB.
> __mpn_lshift_1 is always_inline, but __builtin_constant_p is just an
> optimization, so at -O0 it will evaluate to 0 when the argument isn't a
> constant already during the parsing (and only inlining in that case
> changes it into a constant).
> The reason why the else part doesn't handle count == BITS_PER_MP_LIMB
> properly is
> 1) mpn_lshift itself claims:
>    Argument constraints:
>    1. 0 < CNT < BITS_PER_MP_LIMB
>    2. If the result is to be written over the input, WP must be >= UP.
> 2) even ignoring that, by most of the assembly implementations it might
>    do the right thing, still ptr[0] will be usually unmodified original
>    value and ptr[0] |= limb >> (BITS_PER_MP_LIMB - count);
>    performs ptr[0] |= limb; rather than ptr[0] = limb; that would be
>    desirable for that count.
>
> The reason for __builtin_constant_p is I think that Ulrich didn't want to
> bloat the code, while the compiler could sometimes with VRP figure out
> that for variable count it will be > 0 and < BITS_PER_MP_LIMB, it doesn't
> have to figure that out always and __mpn_lshift_1 code would then be bloated
> by handling both cases, even when we know that only one of them is ever
> needed at the particular call site.
>
> So, to fix this, either we can change __mpn_lshift_1 into a macro, either
> partially using helper inlines as done in the following patch, or fully,
> with no inlines at all, or we could keep the always_inline __mpn_lshift_1,
> just use
>   if (
> #ifdef __OPTIMIZE__
>       __builtin_constant_p (count)
>       &&
> #endif
>          count == BITS_PER_MP_LIMB)
>     {
> ...
>     }
>   else
>     {
> ...
>     }
> so that for -O1+ builds we'd never have extra code in there, but for
> -O0 we would.
>
> So, any preferences on this?  Or perhaps nothing at all, and we'll fix this
> in libquadmath independently.

No, we should fix this here.

I think a macro and two helper functions, as you have done below, is
a good solution and keeps the code from bit-rotting.

No inlines looks like it would be worse code.

Using conditionals on __OPTIMIZE__ will lead to bit-rot since we don't
currently compile glibc with anything but -O2.

> 2013-02-19  Jakub Jelinek  <[hidden email]>
>
> * stdlib/strtod_l.c (__mpn_lshift_1_word, __mpn_lshift_1_var): New
> inline functions.
> (__mpn_lshift_1): Change into macro using the above functions.

So what does the generated code look like after the change?

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix up strtod_l.c for -O0

Jakub Jelinek
On Tue, Feb 19, 2013 at 08:02:06AM -0500, Carlos O'Donell wrote:
> No, we should fix this here.
>
> I think a macro and two helper functions, as you have done below, is
> a good solution and keeps the code from bit-rotting.
>
> No inlines looks like it would be worse code.

Why?  macro vs. inline function should be generally the same code quality if
the inline function is inlined, or perhaps macro could be slightly better if
early optimizations before inlining would make a difference.
But as it is always_inline, it generally should be inlined during early
inlining, so the difference should be minimal.

> Using conditionals on __OPTIMIZE__ will lead to bit-rot since we don't
> currently compile glibc with anything but -O2.
>
> > 2013-02-19  Jakub Jelinek  <[hidden email]>
> >
> > * stdlib/strtod_l.c (__mpn_lshift_1_word, __mpn_lshift_1_var): New
> > inline functions.
> > (__mpn_lshift_1): Change into macro using the above functions.
>
> So what does the generated code look like after the change?

Haven't tried it on glibc, but on libquadmath the difference the patch does
at -O2 is just two spots where two instructions were swapped:
-    1b51:      45 89 cf                mov    %r9d,%r15d
-    1b54:      4c 89 64 24 38          mov    %r12,0x38(%rsp)
+    1b51:      4c 89 64 24 38          mov    %r12,0x38(%rsp)
+    1b56:      45 89 cf                mov    %r9d,%r15d
...
-    1bf0:      48 8b 5c 24 30          mov    0x30(%rsp),%rbx
-    1bf5:      4c 8b 64 24 38          mov    0x38(%rsp),%r12
+    1bf0:      4c 8b 64 24 38          mov    0x38(%rsp),%r12
+    1bf5:      48 8b 5c 24 30          mov    0x30(%rsp),%rbx

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix up strtod_l.c for -O0

Carlos O'Donell-6
On 02/19/2013 08:18 AM, Jakub Jelinek wrote:

> On Tue, Feb 19, 2013 at 08:02:06AM -0500, Carlos O'Donell wrote:
>> No, we should fix this here.
>>
>> I think a macro and two helper functions, as you have done below, is
>> a good solution and keeps the code from bit-rotting.
>>
>> No inlines looks like it would be worse code.
>
> Why?  macro vs. inline function should be generally the same code quality if
> the inline function is inlined, or perhaps macro could be slightly better if
> early optimizations before inlining would make a difference.
> But as it is always_inline, it generally should be inlined during early
> inlining, so the difference should be minimal.

The macro makes the code available earlier for optimization, that's
all, and I agree that the difference is likely minimal. Thus all things
considered I prefer a macro.

Is there any reason not to use a macro?

>> Using conditionals on __OPTIMIZE__ will lead to bit-rot since we don't
>> currently compile glibc with anything but -O2.
>>
>>> 2013-02-19  Jakub Jelinek  <[hidden email]>
>>>
>>> * stdlib/strtod_l.c (__mpn_lshift_1_word, __mpn_lshift_1_var): New
>>> inline functions.
>>> (__mpn_lshift_1): Change into macro using the above functions.
>>
>> So what does the generated code look like after the change?
>
> Haven't tried it on glibc, but on libquadmath the difference the patch does
> at -O2 is just two spots where two instructions were swapped:
> -    1b51:      45 89 cf                mov    %r9d,%r15d
> -    1b54:      4c 89 64 24 38          mov    %r12,0x38(%rsp)
> +    1b51:      4c 89 64 24 38          mov    %r12,0x38(%rsp)
> +    1b56:      45 89 cf                mov    %r9d,%r15d
> ...
> -    1bf0:      48 8b 5c 24 30          mov    0x30(%rsp),%rbx
> -    1bf5:      4c 8b 64 24 38          mov    0x38(%rsp),%r12
> +    1bf0:      4c 8b 64 24 38          mov    0x38(%rsp),%r12
> +    1bf5:      48 8b 5c 24 30          mov    0x30(%rsp),%rbx

So just a little bit of code motion and nothing else, this looks good
to me then.

Could you confirm that you see the same thing with a glibc build and
then checkin your changes on Wednesday (gives another day of review)?

Cheers,
Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix up strtod_l.c for -O0

Carlos O'Donell-2
On 02/19/2013 09:27 AM, Carlos O'Donell wrote:

> On 02/19/2013 08:18 AM, Jakub Jelinek wrote:
>> On Tue, Feb 19, 2013 at 08:02:06AM -0500, Carlos O'Donell wrote:
>>> No, we should fix this here.
>>>
>>> I think a macro and two helper functions, as you have done below, is
>>> a good solution and keeps the code from bit-rotting.
>>>
>>> No inlines looks like it would be worse code.
>>
>> Why?  macro vs. inline function should be generally the same code quality if
>> the inline function is inlined, or perhaps macro could be slightly better if
>> early optimizations before inlining would make a difference.
>> But as it is always_inline, it generally should be inlined during early
>> inlining, so the difference should be minimal.
>
> The macro makes the code available earlier for optimization, that's
> all, and I agree that the difference is likely minimal. Thus all things
> considered I prefer a macro.
>
> Is there any reason not to use a macro?

Just to clarify:

* The number one goal is to avoid any serious performance regressions.

* The number two goal is avoid divergence between copied sources.

* I'm OK with a macro and two inline helper functions as implemented
  in your patch. I'm happy to see that your testing with libquadmath
  shows that this likely produces no measurable performance difference
  in the generated code. Before committing this patch I'd like you to
  verify that is also the case for glibc.

* I have a slight preference for making the entire function into a macro.
  However, I have no evidence that this would generate higher performing
  code in this case.

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix up strtod_l.c for -O0

Jakub Jelinek
On Tue, Feb 19, 2013 at 09:50:20AM -0500, Carlos O'Donell wrote:
> * I have a slight preference for making the entire function into a macro.
>   However, I have no evidence that this would generate higher performing
>   code in this case.

Here it is as a macro without inline helpers.

If I remove the
Implemented as a macro, so that __builtin_constant_p works even at -O0.
and following empty line in comment, so that line numbers match, then in
the whole libc.so the difference is on x86_64 just:
-   3fa9d: 40 3a 33             cmp    (%rbx),%sil
+   3fa9d: 40 38 33             cmp    %sil,(%rbx)
in a single spot.  Of course with the added two lines there are various
differences, because __assert_fail arguments differ (wasn't doing non-debug
build).

2013-02-19  Jakub Jelinek  <[hidden email]>

        * stdlib/strtod_l.c (__mpn_lshift_1): Rewritten as function-like
        macro.

diff --git a/stdlib/strtod_l.c b/stdlib/strtod_l.c
index 5959354..47247b5 100644
--- a/stdlib/strtod_l.c
+++ b/stdlib/strtod_l.c
@@ -444,28 +444,30 @@ str_to_mpn (const STRING_TYPE *str, int digcnt, mp_limb_t *n, mp_size_t *nsize,
 /* Shift {PTR, SIZE} COUNT bits to the left, and fill the vacated bits
    with the COUNT most significant bits of LIMB.
 
-   Tege doesn't like this function so I have to write it here myself. :)
+   Implemented as a macro, so that __builtin_constant_p works even at -O0.
+
+   Tege doesn't like this macro so I have to write it here myself. :)
    --drepper */
-static inline void
-__attribute ((always_inline))
-__mpn_lshift_1 (mp_limb_t *ptr, mp_size_t size, unsigned int count,
- mp_limb_t limb)
-{
-  if (__builtin_constant_p (count) && count == BITS_PER_MP_LIMB)
-    {
-      /* Optimize the case of shifting by exactly a word:
- just copy words, with no actual bit-shifting.  */
-      mp_size_t i;
-      for (i = size - 1; i > 0; --i)
- ptr[i] = ptr[i - 1];
-      ptr[0] = limb;
-    }
-  else
-    {
-      (void) __mpn_lshift (ptr, ptr, size, count);
-      ptr[0] |= limb >> (BITS_PER_MP_LIMB - count);
-    }
-}
+#define __mpn_lshift_1(ptr, size, count, limb) \
+  do \
+    { \
+      mp_limb_t *__ptr = (ptr); \
+      if (__builtin_constant_p (count) && count == BITS_PER_MP_LIMB) \
+ { \
+  mp_size_t i; \
+  for (i = (size) - 1; i > 0; --i) \
+    __ptr[i] = __ptr[i - 1]; \
+  __ptr[0] = (limb); \
+ } \
+      else \
+ { \
+  /* We assume count > 0 && count < BITS_PER_MP_LIMB here.  */ \
+  unsigned int __count = (count); \
+  (void) __mpn_lshift (__ptr, __ptr, size, __count); \
+  __ptr[0] |= (limb) >> (BITS_PER_MP_LIMB - __count); \
+ } \
+    } \
+  while (0)
 
 
 #define INTERNAL(x) INTERNAL1(x)


        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix up strtod_l.c for -O0

Carlos O'Donell-6
On 02/19/2013 11:13 AM, Jakub Jelinek wrote:

> On Tue, Feb 19, 2013 at 09:50:20AM -0500, Carlos O'Donell wrote:
>> * I have a slight preference for making the entire function into a macro.
>>   However, I have no evidence that this would generate higher performing
>>   code in this case.
>
> Here it is as a macro without inline helpers.
>
> If I remove the
> Implemented as a macro, so that __builtin_constant_p works even at -O0.
> and following empty line in comment, so that line numbers match, then in
> the whole libc.so the difference is on x86_64 just:
> -   3fa9d: 40 3a 33             cmp    (%rbx),%sil
> +   3fa9d: 40 38 33             cmp    %sil,(%rbx)
> in a single spot.  Of course with the added two lines there are various
> differences, because __assert_fail arguments differ (wasn't doing non-debug
> build).
>
> 2013-02-19  Jakub Jelinek  <[hidden email]>
>
> * stdlib/strtod_l.c (__mpn_lshift_1): Rewritten as function-like
> macro.
>
> diff --git a/stdlib/strtod_l.c b/stdlib/strtod_l.c
> index 5959354..47247b5 100644
> --- a/stdlib/strtod_l.c
> +++ b/stdlib/strtod_l.c
> @@ -444,28 +444,30 @@ str_to_mpn (const STRING_TYPE *str, int digcnt, mp_limb_t *n, mp_size_t *nsize,
>  /* Shift {PTR, SIZE} COUNT bits to the left, and fill the vacated bits
>     with the COUNT most significant bits of LIMB.
>  
> -   Tege doesn't like this function so I have to write it here myself. :)
> +   Implemented as a macro, so that __builtin_constant_p works even at -O0.
> +
> +   Tege doesn't like this macro so I have to write it here myself. :)
>     --drepper */
> -static inline void
> -__attribute ((always_inline))
> -__mpn_lshift_1 (mp_limb_t *ptr, mp_size_t size, unsigned int count,
> - mp_limb_t limb)
> -{
> -  if (__builtin_constant_p (count) && count == BITS_PER_MP_LIMB)
> -    {
> -      /* Optimize the case of shifting by exactly a word:
> - just copy words, with no actual bit-shifting.  */
> -      mp_size_t i;
> -      for (i = size - 1; i > 0; --i)
> - ptr[i] = ptr[i - 1];
> -      ptr[0] = limb;
> -    }
> -  else
> -    {
> -      (void) __mpn_lshift (ptr, ptr, size, count);
> -      ptr[0] |= limb >> (BITS_PER_MP_LIMB - count);
> -    }
> -}
> +#define __mpn_lshift_1(ptr, size, count, limb) \
> +  do \
> +    { \
> +      mp_limb_t *__ptr = (ptr); \
> +      if (__builtin_constant_p (count) && count == BITS_PER_MP_LIMB) \
> + { \
> +  mp_size_t i; \
> +  for (i = (size) - 1; i > 0; --i) \
> +    __ptr[i] = __ptr[i - 1]; \
> +  __ptr[0] = (limb); \
> + } \
> +      else \
> + { \
> +  /* We assume count > 0 && count < BITS_PER_MP_LIMB here.  */ \
> +  unsigned int __count = (count); \
> +  (void) __mpn_lshift (__ptr, __ptr, size, __count); \
> +  __ptr[0] |= (limb) >> (BITS_PER_MP_LIMB - __count); \
> + } \
> +    } \
> +  while (0)
>  
>  
>  #define INTERNAL(x) INTERNAL1(x)

This looks perfect and I don't think anyone can argue against this.

Please check this in.

Cheers,
Carlos.