[Patch, mips] Faster strcmp for mips

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[Patch, mips] Faster strcmp for mips

Steve Ellcey -2

I have created an optimized strcmp routine for MIPS and would like to check
it in.  I verified it by running 'make check' and I also ran 'make bench'
to check the performance.  This function is slightly slower then the current
generic strcmp for strings that are not aligned, but much faster when strings
are aligned because it will then load and compare either 2, 4, or 8 bytes at a
time.

This means it could be loading bytes beyond the end of the strings being
compared but it looks like other architecture specific strcmp functions
are also doing this optimization and the newlib version of strcmp also does
this.

OK to checkin?

I have attached the original and new bench-strcmp.out files in case anyone
wants to compare the old and new strcmp performance results.

Steve Ellcey
[hidden email]


2013-11-14  Steve Ellcey  <[hidden email]>

        * sysdeps/mips/strcmp.c: New.



strcmp.patch (5K) Download Attachment
bench-strcmp.out.orig (11K) Download Attachment
bench-strcmp.out.new (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Ondřej Bílka
On Thu, Nov 14, 2013 at 01:23:41PM -0800, Steve Ellcey wrote:
>
> I have created an optimized strcmp routine for MIPS and would like to check
> it in.  I verified it by running 'make check' and I also ran 'make bench'
> to check the performance.  This function is slightly slower then the current
> generic strcmp for strings that are not aligned, but much faster when strings
> are aligned because it will then load and compare either 2, 4, or 8 bytes at a
> time.
>
Problem is that aligned access could be only third of total calls. Also
around 90% calls need to check only first 16 bytes. These numbers differ
by application on x64 they are rule rather than exception. I collected
data of running gcc+gnuplot here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html 

> This means it could be loading bytes beyond the end of the strings being
> compared but it looks like other architecture specific strcmp functions
> are also doing this optimization and the newlib version of strcmp also does
> this.
>
Also unaligned case can be made faster with bit of extra effort.
How fast are unaligned loads on MIPS? They should be faster than
assembling value by shifts from two loads manualy which should be faster
than byte-by-byte checking.

What is most improtant for performance is handle first bytes quickly.
You need to check several alternatives, one is unroll checking of first
8-16 bytes. Second is to do a single vector check.

Then you need to implement a loop, what I did on x64 is align one
argument and load second as unaligned.

You need to precompute when a unaligned access would cross page boundary
and switch to byte-by-byte loop when that happens.

Here is modified code that I used to generate x64 implementation, I hope
it will be useful.

You need to change a TEST(a,b) to code that checks 8 bytes starting at a
and b.

Also you may try to unroll these loops to do 16/32 bytes per iteration.

int
strcmp_new (char *a, char *b)
{
  int i;
  int or = (((unsigned long) a) | ((unsigned long) b)) & 4095;
  if (or <= 4096 - 8)
    {
      TEST (a, b);
    }
  else
    {
      for (i = 0; i < 8; i++)
        {
          if (!a[i] || (a[i] - b[i]))
            return a[i] - b[i];
        }
    }

  char *a_al = ALIGN_DOWN (a + 8, 8);
  int dif = a_al - a;
  a += dif;
  b += dif;
  unsigned long b_next = ((unsigned long) b) & (4096 - 1);
  unsigned long next_page = (4096 - b_next) / 8;

  while (1)
    {
      if (__builtin_expect (next_page-- == 0, 0))
        {
          next_page = 4096 / 8;
          next_page--;
          for (i = 0; i < 8; i++)
            {
              if (!a[i] || (a[i] - b[i]))
                return a[i] - b[i];
            }
        }
      TEST (a, b);
      a += 8;
      b += 8;
    }
}

Also here

> +  bx.v = x; by.v = y;
> +  if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0)
> +    return bx.b.B0 - by.b.B0;
> +  if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0)
> +    return bx.b.B1 - by.b.B1;
> +  if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0)
> +    return bx.b.B2 - by.b.B2;
> +  if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0)
> +    return bx.b.B3 - by.b.B3;
> +  if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0)
> +    return bx.b.B4 - by.b.B4;
> +  if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0)
> +    return bx.b.B5 - by.b.B5;
> +  if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0)
> +    return bx.b.B6 - by.b.B6;
> +  return bx.b.B7 - by.b.B7;

There is alternative way to find answer without branching.
You need to compute a number that has nonzero i-th byte if i-th bytes of
x and y differ or they are 0. Then find first nonzero bit which can be
done quickly by CLZ instruction.

This could be implemented as following

uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
uint64_t bitfirst = bitmask ^ (bitmask - 1);
int pos = (ffsl(bitfirst) - 1) / 8;
return a[pos] - b[pos];
Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Steve Ellcey -2
On Fri, 2013-11-15 at 00:14 +0100, Ondřej Bílka wrote:

> Problem is that aligned access could be only third of total calls. Also
> around 90% calls need to check only first 16 bytes. These numbers differ
> by application on x64 they are rule rather than exception. I collected
> data of running gcc+gnuplot here:
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html 

Very interesting, you have obviously spent a lot more time then I have
on strcmp.

> > This means it could be loading bytes beyond the end of the strings being
> > compared but it looks like other architecture specific strcmp functions
> > are also doing this optimization and the newlib version of strcmp also does
> > this.
> >
> Also unaligned case can be made faster with bit of extra effort.
> How fast are unaligned loads on MIPS? They should be faster than
> assembling value by shifts from two loads manualy which should be faster
> than byte-by-byte checking.

MIPS doesn't have unaligned loads but it has 'partial' loads where two
loads can give you four (or eight) bytes in two load operations
regardless of alignment.  I may need to look at this some more though
that will probably mean going to assembly language instead of C.


> Also here
> > +  bx.v = x; by.v = y;
> > +  if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0)
> > +    return bx.b.B0 - by.b.B0;
> > +  if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0)
> > +    return bx.b.B1 - by.b.B1;
> > +  if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0)
> > +    return bx.b.B2 - by.b.B2;
> > +  if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0)
> > +    return bx.b.B3 - by.b.B3;
> > +  if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0)
> > +    return bx.b.B4 - by.b.B4;
> > +  if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0)
> > +    return bx.b.B5 - by.b.B5;
> > +  if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0)
> > +    return bx.b.B6 - by.b.B6;
> > +  return bx.b.B7 - by.b.B7;
>
> There is alternative way to find answer without branching.
> You need to compute a number that has nonzero i-th byte if i-th bytes of
> x and y differ or they are 0. Then find first nonzero bit which can be
> done quickly by CLZ instruction.
>
> This could be implemented as following
>
> uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
> uint64_t bitfirst = bitmask ^ (bitmask - 1);
> int pos = (ffsl(bitfirst) - 1) / 8;
> return a[pos] - b[pos];

This looks very interesting, I tried just plugging it in but it didn't
work in some cases.  I need to stare at this some more to understand
exactly how it should work and what cases are failing for me.

Steve Ellcey
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Carlos O'Donell-2
In reply to this post by Steve Ellcey -2
On Thu, Nov 14, 2013 at 4:23 PM, Steve Ellcey <[hidden email]> wrote:
> This means it could be loading bytes beyond the end of the strings being
> compared but it looks like other architecture specific strcmp functions
> are also doing this optimization and the newlib version of strcmp also does
> this.

I thought that doing so was dangerous? I'm pretty sure we've been trying
to fix such "load bytes beyond the end of the string" issues because you
could have a string that straddles a page boundary with the next page
unmapped and such an optimized routine would fault on a read from the
unmapped page.

How do you plan to fix that?

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Steve Ellcey -2
On Fri, 2013-11-15 at 13:47 -0500, Carlos O'Donell wrote:

> On Thu, Nov 14, 2013 at 4:23 PM, Steve Ellcey <[hidden email]> wrote:
> > This means it could be loading bytes beyond the end of the strings being
> > compared but it looks like other architecture specific strcmp functions
> > are also doing this optimization and the newlib version of strcmp also does
> > this.
>
> I thought that doing so was dangerous? I'm pretty sure we've been trying
> to fix such "load bytes beyond the end of the string" issues because you
> could have a string that straddles a page boundary with the next page
> unmapped and such an optimized routine would fault on a read from the
> unmapped page.
>
> How do you plan to fix that?
>
> Cheers,
> Carlos.

I think the assumption has been that if an aligned word contains at
least one byte of the string, it is OK to read that entire word.  I
don't think a single aligned word (4 or 8 bytes depending on the
architecture) can straddle a page boundary.  If the word was unaligned,
then I think it could straddle a page boundary and you could get into
trouble.

Steve Ellcey
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Carlos O'Donell-6
On 11/15/2013 01:52 PM, Steve Ellcey wrote:

> On Fri, 2013-11-15 at 13:47 -0500, Carlos O'Donell wrote:
>> On Thu, Nov 14, 2013 at 4:23 PM, Steve Ellcey <[hidden email]> wrote:
>>> This means it could be loading bytes beyond the end of the strings being
>>> compared but it looks like other architecture specific strcmp functions
>>> are also doing this optimization and the newlib version of strcmp also does
>>> this.
>>
>> I thought that doing so was dangerous? I'm pretty sure we've been trying
>> to fix such "load bytes beyond the end of the string" issues because you
>> could have a string that straddles a page boundary with the next page
>> unmapped and such an optimized routine would fault on a read from the
>> unmapped page.
>>
>> How do you plan to fix that?
>>
>> Cheers,
>> Carlos.
>
> I think the assumption has been that if an aligned word contains at
> least one byte of the string, it is OK to read that entire word.  I
> don't think a single aligned word (4 or 8 bytes depending on the
> architecture) can straddle a page boundary.  If the word was unaligned,
> then I think it could straddle a page boundary and you could get into
> trouble.

That is obviously correct because the aligned word itself wouldn't
cross the the page boundary. My only comment here is that reading
past the end of the string is a dangerous operation unless you take
care to ensure it doesn't cross a page boundary. Given that you've
clarified that it's only ever a an aligned word load, that can't
possibly cross a page boundary so you're safe.

Sorry for the noise.

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Ondřej Bílka
In reply to this post by Steve Ellcey -2
On Fri, Nov 15, 2013 at 10:20:04AM -0800, Steve Ellcey wrote:

> On Fri, 2013-11-15 at 00:14 +0100, Ondřej Bílka wrote:
>
> > Problem is that aligned access could be only third of total calls. Also
> > around 90% calls need to check only first 16 bytes. These numbers differ
> > by application on x64 they are rule rather than exception. I collected
> > data of running gcc+gnuplot here:
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html 
>
> Very interesting, you have obviously spent a lot more time then I have
> on strcmp.
>
> > > This means it could be loading bytes beyond the end of the strings being
> > > compared but it looks like other architecture specific strcmp functions
> > > are also doing this optimization and the newlib version of strcmp also does
> > > this.
> > >
> > Also unaligned case can be made faster with bit of extra effort.
> > How fast are unaligned loads on MIPS? They should be faster than
> > assembling value by shifts from two loads manualy which should be faster
> > than byte-by-byte checking.
>
> MIPS doesn't have unaligned loads but it has 'partial' loads where two
> loads can give you four (or eight) bytes in two load operations
> regardless of alignment.  I may need to look at this some more though
> that will probably mean going to assembly language instead of C.
>
You can also first try this in c by using shifts.

>
> > Also here
> > > +  bx.v = x; by.v = y;
> > > +  if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0)
> > > +    return bx.b.B0 - by.b.B0;
> > > +  if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0)
> > > +    return bx.b.B1 - by.b.B1;
> > > +  if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0)
> > > +    return bx.b.B2 - by.b.B2;
> > > +  if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0)
> > > +    return bx.b.B3 - by.b.B3;
> > > +  if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0)
> > > +    return bx.b.B4 - by.b.B4;
> > > +  if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0)
> > > +    return bx.b.B5 - by.b.B5;
> > > +  if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0)
> > > +    return bx.b.B6 - by.b.B6;
> > > +  return bx.b.B7 - by.b.B7;
> >
> > There is alternative way to find answer without branching.
> > You need to compute a number that has nonzero i-th byte if i-th bytes of
> > x and y differ or they are 0. Then find first nonzero bit which can be
> > done quickly by CLZ instruction.
> >
> > This could be implemented as following
> >
> > uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
> > uint64_t bitfirst = bitmask ^ (bitmask - 1);
> > int pos = (ffsl(bitfirst) - 1) / 8;
> > return a[pos] - b[pos];
>
> This looks very interesting, I tried just plugging it in but it didn't
> work in some cases.  I need to stare at this some more to understand
> exactly how it should work and what cases are failing for me.
>
I wrote that late at nigth and first idea was use __builtin_clz which
would yield following code.

 uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
 uint64_t bitfirst = bitmask ^ (bitmask - 1);
 int pos = (63 - __builtin_clzl(bitfirst)) / 8;
 return a[pos] - b[pos];

I decided that using ffls was shorter but for some reasons I kept
bitfirst there. A correct version is

 uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
 int pos = (ffsl(bitmask) - 1) / 8;
 return a[pos] - b[pos];

Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Steve Ellcey -2
On Fri, 2013-11-15 at 20:02 +0100, Ondřej Bílka wrote:

> I decided that using ffls was shorter but for some reasons I kept
> bitfirst there. A correct version is
>
>  uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
>  int pos = (ffsl(bitmask) - 1) / 8;
>  return a[pos] - b[pos];

Yes, that works much better.  But it only works in little-endian mode. I
think I would need a fls (find last set) or something similar for
big-endian wouldn't I?  Or else I would need to swap the bytes around
before using ffs/ffsl.

Steve Ellcey
[hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Ondřej Bílka
On Mon, Nov 18, 2013 at 03:37:58PM -0800, Steve Ellcey wrote:

> On Fri, 2013-11-15 at 20:02 +0100, Ondřej Bílka wrote:
>
> > I decided that using ffls was shorter but for some reasons I kept
> > bitfirst there. A correct version is
> >
> >  uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
> >  int pos = (ffsl(bitmask) - 1) / 8;
> >  return a[pos] - b[pos];
>
> Yes, that works much better.  But it only works in little-endian mode. I
> think I would need a fls (find last set) or something similar for
> big-endian wouldn't I?  Or else I would need to swap the bytes around
> before using ffs/ffsl.
>
Yes, a correct function is __builtin_clzl. Difference from ffs is that when you pass zero then result is undefined which should not be problem here.

There are more builtins here:
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Matt Turner-3
In reply to this post by Steve Ellcey -2
On Thu, Nov 14, 2013 at 1:23 PM, Steve Ellcey <[hidden email]> wrote:

>
> I have created an optimized strcmp routine for MIPS and would like to check
> it in.  I verified it by running 'make check' and I also ran 'make bench'
> to check the performance.  This function is slightly slower then the current
> generic strcmp for strings that are not aligned, but much faster when strings
> are aligned because it will then load and compare either 2, 4, or 8 bytes at a
> time.
>
> This means it could be loading bytes beyond the end of the strings being
> compared but it looks like other architecture specific strcmp functions
> are also doing this optimization and the newlib version of strcmp also does
> this.
>
> OK to checkin?
>
> I have attached the original and new bench-strcmp.out files in case anyone
> wants to compare the old and new strcmp performance results.
>
> Steve Ellcey
> [hidden email]
>
>
> 2013-11-14  Steve Ellcey  <[hidden email]>
>
>         * sysdeps/mips/strcmp.c: New.
>
>

Doesn't seem that this was ever committed?
Reply | Threaded
Open this post in threaded view
|

Re: [Patch, mips] Faster strcmp for mips

Steve Ellcey -2
On Wed, 2015-02-18 at 23:24 -0800, Matt Turner wrote:

> > 2013-11-14  Steve Ellcey  <[hidden email]>
> >
> >         * sysdeps/mips/strcmp.c: New.
> >
> >
>
> Doesn't seem that this was ever committed?

That specific patch wasn't checked in but a later patch was checked in
with an assembly language version of MIPS strcmp instead.

From glibc ChangeLog:


2014-10-01  Steve Ellcey  <[hidden email]>

        * sysdeps/mips/strcmp.S: New.


Steve Ellcey
[hidden email]

P.S. The libc-ports mailing list is no longer being used. the libc-alpha
or libc-help mailing lists should be used instead.