RFC: Creating a more efficient sincos interface

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: Creating a more efficient sincos interface

Wilco Dijkstra-2
Hi,

The existing sincos functions use 2 pointers to return the sine and cosine result. In
most cases 4 memory accesses are necessary per call. This is inefficient and often
significantly slower than returning values in registers. I ran a few experiments on the
new optimized sincosf implementation in GLIBC using the following interface:

__complex__ float sincosf2 (float);

This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for
random inputs in the range +-PI/4. Larger inputs take longer and thus have lower
gains, but there is still a 5% gain on the (rarely used) path with full range reduction.
Given sincos is used in various HPC applications this can give a worthwile speedup.

LLVM already supports something similar for OSX using a struct of 2 floats.
Using complex float is better since not all targets may support returning structures in
floating point registers and GCC generates very inefficient code on targets that do
(PR86145).

What do people think? Ideally I'd like to support this in a generic way so all targets can
benefit, but it's also feasible to enable it on a per-target basis. Also since not all libraries
will support the new interface, there would have to be a flag or configure option to switch
the new interface off if not supported (maybe automatically based on the math.h header).

Wilco
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Creating a more efficient sincos interface

H.J. Lu-30
On Thu, Sep 13, 2018 at 6:27 AM, Wilco Dijkstra <[hidden email]> wrote:
> Hi,
>
> The existing sincos functions use 2 pointers to return the sine and cosine result. In
> most cases 4 memory accesses are necessary per call. This is inefficient and often
> significantly slower than returning values in registers. I ran a few experiments on the
> new optimized sincosf implementation in GLIBC using the following interface:
>
> __complex__ float sincosf2 (float);

Is this an internal interface or public one?

> This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for
> random inputs in the range +-PI/4. Larger inputs take longer and thus have lower
> gains, but there is still a 5% gain on the (rarely used) path with full range reduction.
> Given sincos is used in various HPC applications this can give a worthwile speedup.
>
> LLVM already supports something similar for OSX using a struct of 2 floats.
> Using complex float is better since not all targets may support returning structures in
> floating point registers and GCC generates very inefficient code on targets that do
> (PR86145).
>
> What do people think? Ideally I'd like to support this in a generic way so all targets can
> benefit, but it's also feasible to enable it on a per-target basis. Also since not all libraries
> will support the new interface, there would have to be a flag or configure option to switch
> the new interface off if not supported (maybe automatically based on the math.h header).
>
> Wilco



--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Creating a more efficient sincos interface

Florian Weimer-5
In reply to this post by Wilco Dijkstra-2
On 09/13/2018 03:27 PM, Wilco Dijkstra wrote:

> Hi,
>
> The existing sincos functions use 2 pointers to return the sine and cosine result. In
> most cases 4 memory accesses are necessary per call. This is inefficient and often
> significantly slower than returning values in registers. I ran a few experiments on the
> new optimized sincosf implementation in GLIBC using the following interface:
>
> __complex__ float sincosf2 (float);
>
> This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for
> random inputs in the range +-PI/4. Larger inputs take longer and thus have lower
> gains, but there is still a 5% gain on the (rarely used) path with full range reduction.
> Given sincos is used in various HPC applications this can give a worthwile speedup.

I think this is totally fine if you call it expif or something like that
(and put the sine in the imaginary part, of course).

In general, I would object to using complex numbers for arbitrary pairs,
but this doesn't apply to this case.

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Creating a more efficient sincos interface

Szabolcs Nagy-2
On 13/09/18 14:52, Florian Weimer wrote:

> On 09/13/2018 03:27 PM, Wilco Dijkstra wrote:
>> Hi,
>>
>> The existing sincos functions use 2 pointers to return the sine and cosine result. In
>> most cases 4 memory accesses are necessary per call. This is inefficient and often
>> significantly slower than returning values in registers. I ran a few experiments on the
>> new optimized sincosf implementation in GLIBC using the following interface:
>>
>> __complex__ float sincosf2 (float);
>>
>> This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for
>> random inputs in the range +-PI/4. Larger inputs take longer and thus have lower
>> gains, but there is still a 5% gain on the (rarely used) path with full range reduction.
>> Given sincos is used in various HPC applications this can give a worthwile speedup.
>
> I think this is totally fine if you call it expif or something like that (and put the sine in the imaginary part, of course).
>

gcc seems to have a __builtin_cexpif
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/builtins.c;h=58ea7475ef7bb2a8abad2463b896efaa8fd79650;hb=HEAD#l2439

but i dont see it documented, may be we
can add an actual cexpif symbol with the
above signature?

> In general, I would object to using complex numbers for arbitrary pairs, but this doesn't apply to this case.
>
> Thanks,
> Florian

Reply | Threaded
Open this post in threaded view
|

Re: RFC: Creating a more efficient sincos interface

Alexander Monakov-2
In reply to this post by Wilco Dijkstra-2
On Thu, 13 Sep 2018, Wilco Dijkstra wrote:
> What do people think? Ideally I'd like to support this in a generic way so all targets can
> benefit, but it's also feasible to enable it on a per-target basis. Also since not all libraries
> will support the new interface, there would have to be a flag or configure option to switch
> the new interface off if not supported (maybe automatically based on the math.h header).

GCC already has __builtin_cexpi for this, so I think you can introduce cexpi
implementation in libc, and then adjust expand_builtin_cexpi appropriately.

I wonder if it would be possible to add a fallback cexpi implementation to
libgcc.a that would be picked by the linker if there's no such symbol in libm?

Alexander
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Creating a more efficient sincos interface

Joseph Myers
In reply to this post by Florian Weimer-5
On Thu, 13 Sep 2018, Florian Weimer wrote:

> I think this is totally fine if you call it expif or something like that (and
> put the sine in the imaginary part, of course).

And declare it in bits/cmathcalls.h as included from complex.h, rather
than in math.h.  With an appropriate custom RUN_TEST_LOOP_* macro that
deals with the different order of expected results you should be able to
put the tests in libm-test-sincos.inc, sharing the array of expected
results with that for sincos rather than needing to generate a separate
file of expected results with sin and cos swapped.  Presumably you'd want
the various type-generic templates for complex functions using M_SINCOS to
move to using (an implementation-namespace name for) the faster interface,
but that could be a separate patch.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Creating a more efficient sincos interface

Richard Biener
In reply to this post by Alexander Monakov-2
On September 13, 2018 4:32:42 PM GMT+02:00, Alexander Monakov <[hidden email]> wrote:

>On Thu, 13 Sep 2018, Wilco Dijkstra wrote:
>> What do people think? Ideally I'd like to support this in a generic
>way so all targets can
>> benefit, but it's also feasible to enable it on a per-target basis.
>Also since not all libraries
>> will support the new interface, there would have to be a flag or
>configure option to switch
>> the new interface off if not supported (maybe automatically based on
>the math.h header).
>
>GCC already has __builtin_cexpi for this, so I think you can introduce
>cexpi
>implementation in libc, and then adjust expand_builtin_cexpi
>appropriately.

Note currently we expand that to sincos (if available) or cexp. We use it for canonicalization and better optimization on GIMPLE (register promoting the pointed to vars).

>I wonder if it would be possible to add a fallback cexpi implementation
>to
>libgcc.a that would be picked by the linker if there's no such symbol
>in libm?

That would probably be a requirement.

Richard.

>
>Alexander