Variations of memset()

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Variations of memset()

Matthew Wilcox-2

I would like to propose adding some useful variations of memset()
to libc, and potentially to ISO C.  My apologies if these have been
proposed before; I have not been able to find any evidence of them being
considered previously.

Proposal
--------

Add the following 8 functions to libc, the first three as independent
functions and the others as convenience aliases for them:

void *memset16(uint16_t *s, uint16_t v, size_t count);
void *memset32(uint32_t *s, uint32_t v, size_t count);
void *memset64(uint64_t *s, uint64_t v, size_t count);
void *memset_s(unsigned short *p, unsigned short v, size_t count);
void *memset_i(unsigned int *p, unsigned int v, size_t count);
void *memset_l(unsigned long *p, unsigned long v, size_t count);
void *memset_ll(unsigned long long *p, unsigned long long v, size_t count);
void *memset_p(void **p, void *v, size_t count);

Each of these routines is functionally equivalent to:

{
        size_t i;
        for (i = 0; i < count; i++)
                p[i] = v;
        return p;
}

Justification
-------------

1. The above for-loop is a relatively common idiom in existing programs
today.  Experiments show approximately a doubling in performance after
replacing the naive implementation above with hand-optimised assembler
for ARM, x86 and PowerPC.  Other architectures are likely to see similar
performance gains.

2. Searching the web for 'memset16' or 'memset32' shows a number of people
asking for implementations, constructing implementations (sometimes
badly) and using implementations across a wide variety of programs.
Having these functions as part of the standard library would give people
a single place to focus their efforts.

3. Making this part of the standard library would give compiler writers
a common idiom to transform the above for-loop into, improving code which
wasn't converted to use one of the memset variations.

4. C++ has a std::fill function to allow assignment to a range.  This
would be the equivalent functionality in C for base types.

Not proposed
------------

A more generic:
        memfill(void *buf, size_t bufsize, const void *pat, size_t patsize)

has previously been proposed by Lars Wirzenius and would provide more
functionality than the independent routines above.  My research indicates
that this is awkward to use; frequently the value to be assigned is
constructed in-place, and assigning it to a temporary variable in order
to take the address and pass it to memfill() is clumsy.

Replacing the 'count' of elements with the number of bytes leads to
confusion and unnecessary questions on the implementation side about what
to do if 'n' is not a multiple of 'sizeof v'.  Likewise, specifying a
'void *' destination instead of a properly typed destination leads to
confusion around endianness and correct behaviour of pointers which are
not aligned to 'sizeof v'.

I have considered making the return type of these functions 'void'
as the return value is rarely useful.  But I thought it better to keep
compatibility with memset().  I also considered making them return the
same type that they were called with, but I don't think that's useful
either.

Adding memset8() or memset_c() as aliases of memset() would not be
terribly useful in my opinion.  At this time, I am not proposing adding
memset_f(), memset_d() or memset_ld(), though I can certainly see the
utility in floating-point heavy programs.  Variants for some other types
(size_t, ptrdiff_t, wchar_t, the _Complex or _Bool types) could also
be added.

Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Carlos O'Donell-6
On 08/04/2017 01:11 PM, Matthew Wilcox wrote:
> void *memset16(uint16_t *s, uint16_t v, size_t count);
> void *memset32(uint32_t *s, uint32_t v, size_t count);
> void *memset64(uint64_t *s, uint64_t v, size_t count);
> void *memset_s(unsigned short *p, unsigned short v, size_t count);
> void *memset_i(unsigned int *p, unsigned int v, size_t count);
> void *memset_l(unsigned long *p, unsigned long v, size_t count);
> void *memset_ll(unsigned long long *p, unsigned long long v, size_t count);
> void *memset_p(void **p, void *v, size_t count);

How are users expected to use these functions?

What current uses to users use them for?

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Andreas Schwab-2
In reply to this post by Matthew Wilcox-2
On Aug 04 2017, Matthew Wilcox <[hidden email]> wrote:

> void *memset_s(unsigned short *p, unsigned short v, size_t count);

This clashed with Annex K.

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Matthew Wilcox-2
In reply to this post by Carlos O'Donell-6
On Fri, Aug 04, 2017 at 02:09:11PM -0400, Carlos O'Donell wrote:

> On 08/04/2017 01:11 PM, Matthew Wilcox wrote:
> > void *memset16(uint16_t *s, uint16_t v, size_t count);
> > void *memset32(uint32_t *s, uint32_t v, size_t count);
> > void *memset64(uint64_t *s, uint64_t v, size_t count);
> > void *memset_s(unsigned short *p, unsigned short v, size_t count);
> > void *memset_i(unsigned int *p, unsigned int v, size_t count);
> > void *memset_l(unsigned long *p, unsigned long v, size_t count);
> > void *memset_ll(unsigned long long *p, unsigned long long v, size_t count);
> > void *memset_p(void **p, void *v, size_t count);
>
> How are users expected to use these functions?
>
> What current uses to users use them for?

I've identified three places in the Linux kernel which are quite
dissimilar that can make use of such functions.

1. Initialising an array of uint32_t to a particular value.  This is part
of the Symbios SCSI driver.  It is initialisation code.  This benefits
from smaller code size.

2. Decompression in the zram driver.  The zram driver can compress a page
of RAM which consists entirely of a repeated pattern (up to the size of a
'long').  This benefits from the speed of memset_l.

3. Various console tricks (scrolling, etc).  The VGA console consists
of two-byte entries in an array, and inserting a blank line is done with
a call to memset16().

Here's the sample usage from the symbios driver:

-               for (i = 0 ; i < 64 ; i++)
-                       tp->luntbl[i] = cpu_to_scr(vtobus(&np->badlun_sa));
+               memset32(tp->luntbl, cpu_to_scr(vtobus(&np->badlun_sa)), 64);

I expect a lot of users would be of this type; simply replacing the
explicit for-loop equivalent with a library call.
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Carlos O'Donell-6
On 08/04/2017 03:02 PM, Matthew Wilcox wrote:
> Here's the sample usage from the symbios driver:
>
> -               for (i = 0 ; i < 64 ; i++)
> -                       tp->luntbl[i] = cpu_to_scr(vtobus(&np->badlun_sa));
> +               memset32(tp->luntbl, cpu_to_scr(vtobus(&np->badlun_sa)), 64);
>
> I expect a lot of users would be of this type; simply replacing the
> explicit for-loop equivalent with a library call.
 
Have you measured the performance of this kind of conversion when using a
simple application and a library implementing your various memset routines?
In the kernel is one thing, outside of the kernel we have dynamic linking
and no-inling across that shared object boundary.

Should the compiler be doing better?

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Jeff Law
In reply to this post by Matthew Wilcox-2
On 08/04/2017 01:02 PM, Matthew Wilcox wrote:

> On Fri, Aug 04, 2017 at 02:09:11PM -0400, Carlos O'Donell wrote:
>> On 08/04/2017 01:11 PM, Matthew Wilcox wrote:
>>> void *memset16(uint16_t *s, uint16_t v, size_t count);
>>> void *memset32(uint32_t *s, uint32_t v, size_t count);
>>> void *memset64(uint64_t *s, uint64_t v, size_t count);
>>> void *memset_s(unsigned short *p, unsigned short v, size_t count);
>>> void *memset_i(unsigned int *p, unsigned int v, size_t count);
>>> void *memset_l(unsigned long *p, unsigned long v, size_t count);
>>> void *memset_ll(unsigned long long *p, unsigned long long v, size_t count);
>>> void *memset_p(void **p, void *v, size_t count);
>>
>> How are users expected to use these functions?
>>
>> What current uses to users use them for?
>
> I've identified three places in the Linux kernel which are quite
> dissimilar that can make use of such functions.
>
> 1. Initialising an array of uint32_t to a particular value.  This is part
> of the Symbios SCSI driver.  It is initialisation code.  This benefits
> from smaller code size
>
> 2. Decompression in the zram driver.  The zram driver can compress a page
> of RAM which consists entirely of a repeated pattern (up to the size of a
> 'long').  This benefits from the speed of memset_l.
So for these cases would it make more sense for the compiler to identify
the initialization loop and have it call into the new routines in glibc
rather than changing the kernel or various applications to call those
routines.

GCC already has the ability to identify loops that boil down to a
memset.  Extending that support to handle initializations using objects
> 8 bits in size would seem straighforward to me.

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Alexander Monakov-2
On Fri, 4 Aug 2017, Jeff Law wrote:

> > I've identified three places in the Linux kernel which are quite
> > dissimilar that can make use of such functions.
> >
> > 1. Initialising an array of uint32_t to a particular value.  This is part
> > of the Symbios SCSI driver.  It is initialisation code.  This benefits
> > from smaller code size
> >
> > 2. Decompression in the zram driver.  The zram driver can compress a page
> > of RAM which consists entirely of a repeated pattern (up to the size of a
> > 'long').  This benefits from the speed of memset_l.
> So for these cases would it make more sense for the compiler to identify
> the initialization loop and have it call into the new routines in glibc
> rather than changing the kernel or various applications to call those
> routines.

The Linux kernel does not (and cannot) link against Glibc at all...

Alexander
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Carlos O'Donell-6
On 08/04/2017 03:18 PM, Alexander Monakov wrote:

> On Fri, 4 Aug 2017, Jeff Law wrote:
>>> I've identified three places in the Linux kernel which are quite
>>> dissimilar that can make use of such functions.
>>>
>>> 1. Initialising an array of uint32_t to a particular value.  This is part
>>> of the Symbios SCSI driver.  It is initialisation code.  This benefits
>>> from smaller code size
>>>
>>> 2. Decompression in the zram driver.  The zram driver can compress a page
>>> of RAM which consists entirely of a repeated pattern (up to the size of a
>>> 'long').  This benefits from the speed of memset_l.
>> So for these cases would it make more sense for the compiler to identify
>> the initialization loop and have it call into the new routines in glibc
>> rather than changing the kernel or various applications to call those
>> routines.
>
> The Linux kernel does not (and cannot) link against Glibc at all...

What matters is that the compiler turns the initialization into a call to an
implementation dependent symbol.

Where the implementation comes from can vary. In userspace it can come from
glibc, in the kernel it can come from some other object file.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Carlos O'Donell-6
In reply to this post by Carlos O'Donell-6
On 08/04/2017 03:11 PM, Carlos O'Donell wrote:

> On 08/04/2017 03:02 PM, Matthew Wilcox wrote:
>> Here's the sample usage from the symbios driver:
>>
>> -               for (i = 0 ; i < 64 ; i++)
>> -                       tp->luntbl[i] = cpu_to_scr(vtobus(&np->badlun_sa));
>> +               memset32(tp->luntbl, cpu_to_scr(vtobus(&np->badlun_sa)), 64);
>>
>> I expect a lot of users would be of this type; simply replacing the
>> explicit for-loop equivalent with a library call.
>  
> Have you measured the performance of this kind of conversion when using a
> simple application and a library implementing your various memset routines?
> In the kernel is one thing, outside of the kernel we have dynamic linking
> and no-inling across that shared object boundary.

I want to  reiterate that measuring the performance of various options in
userspace is going to be relevant (particularly when they vary from the kernel):

* Application doing the naive loop above (-O0).

* Application doing the naive loop above ([-O2,-O3] + <vectorize options>).

* Application calling memset32 (-O0)

* Application calling memset32 (-O3)

<vectorize options>="-ftree-vectorize [-msse2,-mavx] -fopt-info-missed=missed.all"

You need to split the memset32 into another DSO to simulate this accurately.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

H.J. Lu-30
On Fri, Aug 4, 2017 at 12:54 PM, Carlos O'Donell <[hidden email]> wrote:

> On 08/04/2017 03:11 PM, Carlos O'Donell wrote:
>> On 08/04/2017 03:02 PM, Matthew Wilcox wrote:
>>> Here's the sample usage from the symbios driver:
>>>
>>> -               for (i = 0 ; i < 64 ; i++)
>>> -                       tp->luntbl[i] = cpu_to_scr(vtobus(&np->badlun_sa));
>>> +               memset32(tp->luntbl, cpu_to_scr(vtobus(&np->badlun_sa)), 64);
>>>
>>> I expect a lot of users would be of this type; simply replacing the
>>> explicit for-loop equivalent with a library call.
>>
>> Have you measured the performance of this kind of conversion when using a
>> simple application and a library implementing your various memset routines?
>> In the kernel is one thing, outside of the kernel we have dynamic linking
>> and no-inling across that shared object boundary.
>
> I want to  reiterate that measuring the performance of various options in
> userspace is going to be relevant (particularly when they vary from the kernel):
>
> * Application doing the naive loop above (-O0).
>
> * Application doing the naive loop above ([-O2,-O3] + <vectorize options>).
>
> * Application calling memset32 (-O0)
>
> * Application calling memset32 (-O3)
>
> <vectorize options>="-ftree-vectorize [-msse2,-mavx] -fopt-info-missed=missed.all"
>
> You need to split the memset32 into another DSO to simulate this accurately.
>

These functions aren't very useful for x86-64 where wmemset,
aka, memset32, is implemented with memset:

Dump of assembler code for function __wmemset_sse2_unaligned:
   0x0000000000000020 <+0>: shl    $0x2,%rdx
   0x0000000000000024 <+4>: movd   %esi,%xmm0
   0x0000000000000028 <+8>: mov    %rdi,%rax
   0x000000000000002b <+11>: pshufd $0x0,%xmm0,%xmm0
   0x0000000000000030 <+16>: jmp    0x64 <__memset_sse2_unaligned+20>
End of assembler dump.


--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Joseph Myers
In reply to this post by Matthew Wilcox-2
On Fri, 4 Aug 2017, Matthew Wilcox wrote:

> void *memset16(uint16_t *s, uint16_t v, size_t count);
> void *memset32(uint32_t *s, uint32_t v, size_t count);
> void *memset64(uint64_t *s, uint64_t v, size_t count);

Is your assertion that there are useful optimizations for these that
result in code too large for it to make sense for the compiler to inline
it, so an out-of-line function call is appropriate instead?  I'd expect
the compiler to vectorize memset-like loops where appropriate, inline,
without needing such functions.  Or is the desire to use IFUNCs for such
functions so code can be optimized for the processor it is running on,
without needing to be compiled for each processor with different vector
instructions that should be used?

(As noted, wmemset already provides memset32 semantics for all glibc
configurations, subject to any ABI questions of sign/zero extension of the
second argument for some ABIs if wchar_t is signed and this affects how
the argument is passed compared to uint32_t.)

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Variations of memset()

Florian Weimer-5
On 08/04/2017 10:57 PM, Joseph Myers wrote:

> On Fri, 4 Aug 2017, Matthew Wilcox wrote:
>
>> void *memset16(uint16_t *s, uint16_t v, size_t count);
>> void *memset32(uint32_t *s, uint32_t v, size_t count);
>> void *memset64(uint64_t *s, uint64_t v, size_t count);
>
> Is your assertion that there are useful optimizations for these that
> result in code too large for it to make sense for the compiler to inline
> it, so an out-of-line function call is appropriate instead?  I'd expect
> the compiler to vectorize memset-like loops where appropriate, inline,
> without needing such functions.

Small-scale vectorization doesn't work well in practice because most
current Intel CPUs have very high startup costs for using vector
instructions.  To some degree, this applies to AVX2 as well, but it is
allegedly quite visible with AVX512, leading to a reduction in overall
system throughput:

  <https://sourceware.org/ml/libc-alpha/2017-04/msg00325.html>

> Or is the desire to use IFUNCs for such
> functions so code can be optimized for the processor it is running on,
> without needing to be compiled for each processor with different vector
> instructions that should be used?

It's also system-specific.  Whether the more recent vector instructions
help depends a lot on the overall workload on the system.  My current
impression is that auto-vectorization beyond SSE2 is extremely tricky
and probably not worth it for generic/pre-compiled software.

On the other hand, if you deploy on other people's machines, you might
not care that you negatively impact overall system performance, as long
as you squeeze out a few more percent for yourself.  Based on the media
reports I have read (and the patch I referenced above), AVX512 in
particular gives rise to a prisoners' dilemma in this context. 8-(

Thanks,
Florian