New x86-64 micro-architecture levels

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
Most Linux distributions still compile against the original x86-64
baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
EM64T compatibility).

There has been an attempt to use the existing AT_PLATFORM-based loading
mechanism in the glibc dynamic linker to enable a selection of optimized
libraries.  But the general selection mechanism in glibc is problematic:

  hwcaps subdirectory selection in the dynamic loader
  <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>

We also have the problem that the glibc version of "haswell" is distinct
from GCC's -march=haswell (and presumably other compilers):

  Definition of "haswell" platform is inconsistent with GCC
  <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>

And that the selection criteria are not what people expect:

  Epyc and other current AMD CPUs do not select the "haswell" platform
  subdirectory
  <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>

Since the hwcaps-based selection does not work well regardless of
architecture (even in cases the kernel provides glibc with data), I
worked on a new mechanism that does not have the problems associated
with the old mechanism:

  [PATCH 00/30] RFC: elf: glibc-hwcaps support
  <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>

(Don't be concerned that these patches have not been reviewed; we are
busy preparing the glibc 2.32 release, and these changes do not alter
the glibc ABI itself, so they do not have immediate priority.  I'm
fairly confident that a version of these changes will make it into glibc
2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat
Enterprise Linux 8.4.  Debian as well, but I have never done anything
like it there, so I don't know if the patches will be accepted.)

Out of the box, this should work fairly well for IBM POWER and Z, where
there is a clear progression of silicon versions (at least on paper
—virtualization may blur the picture somewhat).

However, for x86, we do not have such a clear progression of
micro-architecture versions.  This is not just as a result of the
AMD/Intel competition, but also due to ongoing product differentiation
within one chip vendor.  I think we need these levels broadly for the
following reasons:

* Selecting on individual CPU features (similar to the old hwcaps
  mechanism) in glibc has scalability issues, particularly for
  LD_LIBRARY_PATH processing.

* Developers need guidance about useful targets for optimization.  I
  think there is value in limiting the choices, in the sense that “if
  you are able to test three builds in total, these are the things you
  should build”.

* glibc and the compilers should align in their definition of the
  levels, so that developers can use an -march= option to build for a
  particular level that is recognized by glibc.  This is why I think the
  description of the levels should go into the psABI supplement.

* A preference order for these levels avoids falling back to the K8
  baseline if the platform progresses to a new version due to
  glibc/kernel/hypervisor/hardware upgrades.

I'm including a proposal for the levels below.  I use single letters for
them, but I expect that the concrete implementation of this proposal
will use names like “x86-100”, “x86-101”, like in the glibc patch
referenced above.  (But we can discuss other approaches.)

I looked at various machines in the Red Hat labs and talked to Intel and
AMD engineers about this, but this concrete proposal is based on my own
analysis of the situation.  I excluded CPU features related to
cryptography and cache management, including hardware transactional
memory, and CPU timing.  I assume that we will see some of these
features being disabled by the firmware or the kernel over time.  That
would eliminate entire levels from selection, which is not desirable.
For cryptographic code, I expect that localized selection of an
optimized implementation works because such code tends to be isolated
blocks, running for dozens of cycles each time, not something that gets
scattered all over the place by the compiler.

We previously discussed not emitting VZEROUPPER at later levels, but I
don't think this is beneficial because the ABI does not have
callee-saved vector registers, so it can only be useful with local
functions (or whatever LTO considers local), where there is no ABI
impact anyway.

I did not include FSGSBASE because the FS base is already available at
%fs:0.  Changing the FS base in userspace breaks too much, so the main
benefit is the tighter encoding of rdfsbase, which seems very slim.

Not covered in this are tuning decisions.  I think we can benefit from
some variance in this area between implementations; it should not affect
correctness.  32-bit support is also a separate matter.

* Level A

CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3

This is one step above the K8 baseline and corresponds to a mainline CPU
model ca. 2008 to 2011.  It is also implemented by recent-ish
generations of Intel Atom server CPUs (although I haven't tested the
latest version).  A 32-bit variant would have to list many additional
CPU features here.

* Level B

AVX, plus everything in level A.

This step is so small that it probably can be dropped, unless the
benefits from using VEX encoding are truly significant.

For AVX and some of the following features, it is assumed that the
run-time selection takes full support coverage (from silicon to the
kernel) into account.

* Level C

AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.

This is close to what glibc currently calls "haswell".

* Level D

AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
level C.

This is the AVX-512 level implemented by Xeon Scalable Processors, not
the Xeon Phi variant.


glibc (or an alternative loader implementation) would search for
libraries starting at level D, going back to level A, and finally the
baseline implementation in the default library location.

I expect that some distributions will also use these levels to set a
baseline for the entire distribution (i.e., everything would be built to
level A or maybe even level C), and these libraries would then be
installed in the default location.

I'll be glad if I can get any feedback on this proposal.  I plan to turn
it into a merge request for the x86-64 psABI document eventually.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Joseph Myers
On Fri, 10 Jul 2020, Florian Weimer via Gcc wrote:

> * Level A
>
> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
>
> This is one step above the K8 baseline and corresponds to a mainline CPU
> model ca. 2008 to 2011.  It is also implemented by recent-ish
> generations of Intel Atom server CPUs (although I haven't tested the
> latest version).  A 32-bit variant would have to list many additional
> CPU features here.

FWIW, this is also the limit of what can be run under QEMU emulation, as
QEMU lacks support for AVX and newer instruction set features.

On the other hand, virtual machines seem liable to report something closer
to the K8 baseline to the guest OS, missing the level A features, even
when the underlying hardware supports everything in level B or level C.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Sourceware - libc-alpha mailing list
On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <[hidden email]> wrote:

>
> Most Linux distributions still compile against the original x86-64
> baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
> EM64T compatibility).
>
> There has been an attempt to use the existing AT_PLATFORM-based loading
> mechanism in the glibc dynamic linker to enable a selection of optimized
> libraries.  But the general selection mechanism in glibc is problematic:
>
>   hwcaps subdirectory selection in the dynamic loader
>   <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>
>
> We also have the problem that the glibc version of "haswell" is distinct
> from GCC's -march=haswell (and presumably other compilers):
>
>   Definition of "haswell" platform is inconsistent with GCC
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>
>
> And that the selection criteria are not what people expect:
>
>   Epyc and other current AMD CPUs do not select the "haswell" platform
>   subdirectory
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>
>
> Since the hwcaps-based selection does not work well regardless of
> architecture (even in cases the kernel provides glibc with data), I
> worked on a new mechanism that does not have the problems associated
> with the old mechanism:
>
>   [PATCH 00/30] RFC: elf: glibc-hwcaps support
>   <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>
>
> (Don't be concerned that these patches have not been reviewed; we are
> busy preparing the glibc 2.32 release, and these changes do not alter
> the glibc ABI itself, so they do not have immediate priority.  I'm
> fairly confident that a version of these changes will make it into glibc
> 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat
> Enterprise Linux 8.4.  Debian as well, but I have never done anything
> like it there, so I don't know if the patches will be accepted.)
>
> Out of the box, this should work fairly well for IBM POWER and Z, where
> there is a clear progression of silicon versions (at least on paper
> —virtualization may blur the picture somewhat).
>
> However, for x86, we do not have such a clear progression of
> micro-architecture versions.  This is not just as a result of the
> AMD/Intel competition, but also due to ongoing product differentiation
> within one chip vendor.  I think we need these levels broadly for the
> following reasons:
>
> * Selecting on individual CPU features (similar to the old hwcaps
>   mechanism) in glibc has scalability issues, particularly for
>   LD_LIBRARY_PATH processing.
>
> * Developers need guidance about useful targets for optimization.  I
>   think there is value in limiting the choices, in the sense that “if
>   you are able to test three builds in total, these are the things you
>   should build”.
>
> * glibc and the compilers should align in their definition of the
>   levels, so that developers can use an -march= option to build for a
>   particular level that is recognized by glibc.  This is why I think the
>   description of the levels should go into the psABI supplement.
>
> * A preference order for these levels avoids falling back to the K8
>   baseline if the platform progresses to a new version due to
>   glibc/kernel/hypervisor/hardware upgrades.
>
> I'm including a proposal for the levels below.  I use single letters for
> them, but I expect that the concrete implementation of this proposal
> will use names like “x86-100”, “x86-101”, like in the glibc patch
> referenced above.  (But we can discuss other approaches.)
>
> I looked at various machines in the Red Hat labs and talked to Intel and
> AMD engineers about this, but this concrete proposal is based on my own
> analysis of the situation.  I excluded CPU features related to
> cryptography and cache management, including hardware transactional
> memory, and CPU timing.  I assume that we will see some of these
> features being disabled by the firmware or the kernel over time.  That
> would eliminate entire levels from selection, which is not desirable.
> For cryptographic code, I expect that localized selection of an
> optimized implementation works because such code tends to be isolated
> blocks, running for dozens of cycles each time, not something that gets
> scattered all over the place by the compiler.
>
> We previously discussed not emitting VZEROUPPER at later levels, but I
> don't think this is beneficial because the ABI does not have
> callee-saved vector registers, so it can only be useful with local
> functions (or whatever LTO considers local), where there is no ABI
> impact anyway.
>
> I did not include FSGSBASE because the FS base is already available at
> %fs:0.  Changing the FS base in userspace breaks too much, so the main
> benefit is the tighter encoding of rdfsbase, which seems very slim.
>
> Not covered in this are tuning decisions.  I think we can benefit from
> some variance in this area between implementations; it should not affect
> correctness.  32-bit support is also a separate matter.
>
> * Level A
>
> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
>
> This is one step above the K8 baseline and corresponds to a mainline CPU
> model ca. 2008 to 2011.  It is also implemented by recent-ish
> generations of Intel Atom server CPUs (although I haven't tested the
> latest version).  A 32-bit variant would have to list many additional
> CPU features here.
>
> * Level B
>
> AVX, plus everything in level A.
>
> This step is so small that it probably can be dropped, unless the
> benefits from using VEX encoding are truly significant.
>
> For AVX and some of the following features, it is assumed that the
> run-time selection takes full support coverage (from silicon to the
> kernel) into account.
>
> * Level C
>
> AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.
>
> This is close to what glibc currently calls "haswell".
>
> * Level D
>
> AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
> level C.
>
> This is the AVX-512 level implemented by Xeon Scalable Processors, not
> the Xeon Phi variant.
>
>
> glibc (or an alternative loader implementation) would search for
> libraries starting at level D, going back to level A, and finally the
> baseline implementation in the default library location.
>
> I expect that some distributions will also use these levels to set a
> baseline for the entire distribution (i.e., everything would be built to
> level A or maybe even level C), and these libraries would then be
> installed in the default location.
>
> I'll be glad if I can get any feedback on this proposal.  I plan to turn
> it into a merge request for the x86-64 psABI document eventually.
>

Looks good.  I like it.   My only concerns are

1. Names like “x86-100”, “x86-101”, what features do they support?
2. I have a library with AVX2 and FMA, which directory should it go?

Can we pass such info to ld.so and ld.so prints out the best directory
name?

--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Allan Sandfeld Jensen
In reply to this post by Sourceware - libc-alpha mailing list
On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc wrote:

> glibc (or an alternative loader implementation) would search for
> libraries starting at level D, going back to level A, and finally the
> baseline implementation in the default library location.
>
> I expect that some distributions will also use these levels to set a
> baseline for the entire distribution (i.e., everything would be built to
> level A or maybe even level C), and these libraries would then be
> installed in the default location.
>
> I'll be glad if I can get any feedback on this proposal.  I plan to turn
> it into a merge request for the x86-64 psABI document eventually.
>
Sounds good, though if I could dream I would also love a partial replacement
option. So that you could have a generic x86-64 binary that only had some AVX2
optimized replacement functions in a supplementary library.

Perhaps implemented by marked the library as a partial replacement, so the
dynamic linker would also load the base or lower libraries except for
functions already resolved.

You could also add a level E for the AVX512 instructions in ice lake and
above. The VBMI1/2 instructions would likely be useful for autovectorization
in GCC.

'Allan


Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Sourceware - libc-alpha mailing list
On Fri, Jul 10, 2020 at 11:45 PM H.J. Lu via Gcc <[hidden email]> wrote:

>
> On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <[hidden email]> wrote:
> >
> > Most Linux distributions still compile against the original x86-64
> > baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
> > EM64T compatibility).
> >
> > There has been an attempt to use the existing AT_PLATFORM-based loading
> > mechanism in the glibc dynamic linker to enable a selection of optimized
> > libraries.  But the general selection mechanism in glibc is problematic:
> >
> >   hwcaps subdirectory selection in the dynamic loader
> >   <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>
> >
> > We also have the problem that the glibc version of "haswell" is distinct
> > from GCC's -march=haswell (and presumably other compilers):
> >
> >   Definition of "haswell" platform is inconsistent with GCC
> >   <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>
> >
> > And that the selection criteria are not what people expect:
> >
> >   Epyc and other current AMD CPUs do not select the "haswell" platform
> >   subdirectory
> >   <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>
> >
> > Since the hwcaps-based selection does not work well regardless of
> > architecture (even in cases the kernel provides glibc with data), I
> > worked on a new mechanism that does not have the problems associated
> > with the old mechanism:
> >
> >   [PATCH 00/30] RFC: elf: glibc-hwcaps support
> >   <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>
> >
> > (Don't be concerned that these patches have not been reviewed; we are
> > busy preparing the glibc 2.32 release, and these changes do not alter
> > the glibc ABI itself, so they do not have immediate priority.  I'm
> > fairly confident that a version of these changes will make it into glibc
> > 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat
> > Enterprise Linux 8.4.  Debian as well, but I have never done anything
> > like it there, so I don't know if the patches will be accepted.)
> >
> > Out of the box, this should work fairly well for IBM POWER and Z, where
> > there is a clear progression of silicon versions (at least on paper
> > —virtualization may blur the picture somewhat).
> >
> > However, for x86, we do not have such a clear progression of
> > micro-architecture versions.  This is not just as a result of the
> > AMD/Intel competition, but also due to ongoing product differentiation
> > within one chip vendor.  I think we need these levels broadly for the
> > following reasons:
> >
> > * Selecting on individual CPU features (similar to the old hwcaps
> >   mechanism) in glibc has scalability issues, particularly for
> >   LD_LIBRARY_PATH processing.
> >
> > * Developers need guidance about useful targets for optimization.  I
> >   think there is value in limiting the choices, in the sense that “if
> >   you are able to test three builds in total, these are the things you
> >   should build”.
> >
> > * glibc and the compilers should align in their definition of the
> >   levels, so that developers can use an -march= option to build for a
> >   particular level that is recognized by glibc.  This is why I think the
> >   description of the levels should go into the psABI supplement.
> >
> > * A preference order for these levels avoids falling back to the K8
> >   baseline if the platform progresses to a new version due to
> >   glibc/kernel/hypervisor/hardware upgrades.
> >
> > I'm including a proposal for the levels below.  I use single letters for
> > them, but I expect that the concrete implementation of this proposal
> > will use names like “x86-100”, “x86-101”, like in the glibc patch
> > referenced above.  (But we can discuss other approaches.)
> >
> > I looked at various machines in the Red Hat labs and talked to Intel and
> > AMD engineers about this, but this concrete proposal is based on my own
> > analysis of the situation.  I excluded CPU features related to
> > cryptography and cache management, including hardware transactional
> > memory, and CPU timing.  I assume that we will see some of these
> > features being disabled by the firmware or the kernel over time.  That
> > would eliminate entire levels from selection, which is not desirable.
> > For cryptographic code, I expect that localized selection of an
> > optimized implementation works because such code tends to be isolated
> > blocks, running for dozens of cycles each time, not something that gets
> > scattered all over the place by the compiler.
> >
> > We previously discussed not emitting VZEROUPPER at later levels, but I
> > don't think this is beneficial because the ABI does not have
> > callee-saved vector registers, so it can only be useful with local
> > functions (or whatever LTO considers local), where there is no ABI
> > impact anyway.
> >
> > I did not include FSGSBASE because the FS base is already available at
> > %fs:0.  Changing the FS base in userspace breaks too much, so the main
> > benefit is the tighter encoding of rdfsbase, which seems very slim.
> >
> > Not covered in this are tuning decisions.  I think we can benefit from
> > some variance in this area between implementations; it should not affect
> > correctness.  32-bit support is also a separate matter.
> >
> > * Level A
> >
> > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
> >
> > This is one step above the K8 baseline and corresponds to a mainline CPU
> > model ca. 2008 to 2011.  It is also implemented by recent-ish
> > generations of Intel Atom server CPUs (although I haven't tested the
> > latest version).  A 32-bit variant would have to list many additional
> > CPU features here.
> >
> > * Level B
> >
> > AVX, plus everything in level A.
> >
> > This step is so small that it probably can be dropped, unless the
> > benefits from using VEX encoding are truly significant.
> >
> > For AVX and some of the following features, it is assumed that the
> > run-time selection takes full support coverage (from silicon to the
> > kernel) into account.
> >
> > * Level C
> >
> > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.
> >
> > This is close to what glibc currently calls "haswell".
> >
> > * Level D
> >
> > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
> > level C.
> >
> > This is the AVX-512 level implemented by Xeon Scalable Processors, not
> > the Xeon Phi variant.
> >
> >
> > glibc (or an alternative loader implementation) would search for
> > libraries starting at level D, going back to level A, and finally the
> > baseline implementation in the default library location.
> >
> > I expect that some distributions will also use these levels to set a
> > baseline for the entire distribution (i.e., everything would be built to
> > level A or maybe even level C), and these libraries would then be
> > installed in the default location.
> >
> > I'll be glad if I can get any feedback on this proposal.  I plan to turn
> > it into a merge request for the x86-64 psABI document eventually.
> >
>
> Looks good.  I like it.

Likewise.  Btw, did you check that VIA family chips slot into Level A
at least?  Where do AMD bdverN slot in?

>  My only concerns are
>
> 1. Names like “x86-100”, “x86-101”, what features do they support?

Indeed I didn't get the -100, -101 part.  On the GCC side I'd have
suggested -march=generic-{A,B,C,D} implying the respective
-mtune.

Do the patches end up annotating ELF binaries with the architecture
level and does ld.so check that info?

For example IIRC there's a penalty to switch between VEX and
not VEX encoded instructions so even on AVX capable hardware
it might be profitable to use non-AVX libraries if the program is
using only architecture level A?

On that side, does architecture level B+ suggest using VEX encoding
everywhere?  It would be indeed nice to have the architecture levels
documented in the psABI.

> 2. I have a library with AVX2 and FMA, which directory should it go?

Eventually GCC/gas can annotate objects with the lowest architecture
level that is applicable?

Thanks for doing this,
Richard.

> Can we pass such info to ld.so and ld.so prints out the best directory
> name?
>
> --
> H.J.
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Sourceware - libc-alpha mailing list
* H. J. Lu:

> Looks good.  I like it.

Thanks.  What do you think about Level B?  Should we keep it?

> My only concerns are
>
> 1. Names like “x86-100”, “x86-101”, what features do they support?

I think we can add more diagnostic output to ld.so --help.  My patch
does not show individual CPU flags, but I agree this could be useful.
(It's not needed for the legacy HWCAP subdirectories because in general,
those are named & defined by the kernel, not by individually named CPU
feature flags.)

> 2. I have a library with AVX2 and FMA, which directory should it go?
>
> Can we pass such info to ld.so and ld.so prints out the best directory
> name?

I think this would require generating matching GNU property notes (list
the CPU features required by the binary).  Once we have that, we can add
something to binutils or indeed ld.so to analyze them and print the
recommended directory.  But I think this is something that could come
later.

We can also write a GCC header which looks at macros such as __AVX2__
and prints a #warning with the recommended directory name.  Checking for
excess flags will be tricky in this context, though, and if we miss
something, a wrong recommendation will be the result.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Allan Sandfeld Jensen
* Allan Sandfeld Jensen:

> On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc wrote:
>> glibc (or an alternative loader implementation) would search for
>> libraries starting at level D, going back to level A, and finally the
>> baseline implementation in the default library location.
>>
>> I expect that some distributions will also use these levels to set a
>> baseline for the entire distribution (i.e., everything would be built to
>> level A or maybe even level C), and these libraries would then be
>> installed in the default location.
>>
>> I'll be glad if I can get any feedback on this proposal.  I plan to turn
>> it into a merge request for the x86-64 psABI document eventually.

> Sounds good, though if I could dream I would also love a partial
> replacement option. So that you could have a generic x86-64 binary
> that only had some AVX2 optimized replacement functions in a
> supplementary library.
>
> Perhaps implemented by marked the library as a partial replacement, so
> the dynamic linker would also load the base or lower libraries except
> for functions already resolved.

I think you can do something like it today, at least from the glibc
dynamic loader perspective.  Programs link against the soname of the
optimized shared object (which can be empty), and that shared object
depends on the object with the fallback implementation.  A special
link-only shared object containing just the ABI under the front soname
(that of the optimized library) would be used via a linker object, so
that it is not possible to accidentally link against the wrong soname.

For non-versioned symbols, this setup has worked since forever.  For
versioned symbols, delegating from the optimized to the unoptimized
library needs at least glibc 2.30, with commit f0b2132b35248c1f4a80
("ld.so: Support moving versioned symbols between sonames [BZ #24741]"),
although some of us have backported this commit into earlier releases.

Where this falls flat is support for LTO and
-fno-semantic-interposition.  Some care is needed to make precisely the
right set of symbols interposable.  But to honest, I'm not sure if this
entire mechanism is a big improvement over function multi-versioning.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Sourceware - libc-alpha mailing list
* Richard Biener:

>> Looks good.  I like it.
>
> Likewise.  Btw, did you check that VIA family chips slot into Level A
> at least?

Those seem to lack SSE4.2, so they land in the baseline.

> Where do AMD bdverN slot in?

bdver1 to bdver3 (as defined by GCC) should land in Level B (so Level A
if that is dropped).  bdver4 and znver1 (and later) should land in
Level C.

>>  My only concerns are
>>
>> 1. Names like “x86-100”, “x86-101”, what features do they support?
>
> Indeed I didn't get the -100, -101 part.  On the GCC side I'd have
> suggested -march=generic-{A,B,C,D} implying the respective
> -mtune.

With literal A, B, C, D, or are they just placeholders?  If not literal
levels, then what we should use there?

I like the simplicity of numbers.  I used letters in the proposal to
avoid confusion if we alter the proposal by dropping or levels, shifting
the meaning of those that come later.  I expect to switch back to
numbers again for the final version.

> Do the patches end up annotating ELF binaries with the architecture
> level and does ld.so check that info?

This is a separate feature that H.J. has been working on.

> For example IIRC there's a penalty to switch between VEX and
> not VEX encoded instructions so even on AVX capable hardware
> it might be profitable to use non-AVX libraries if the program is
> using only architecture level A?

But this is impossible to know in general.  It may also be possible that
the library contains an inner loop that can be nicely vectorized with
AVX instructions, but not with SSE4.2 instructions and earlier.  Then
preferring the non-AVX version would be a mistake.

Regarding the transition penalty, I believe this is mostly addressed by
those VZEROUPPER instructions?  I've already explained why I think those
aren't a viable optimization target, given the current calling
convention.

My glibc patches already provide a way to mask subdirectories which
would otherwise be selected, so manual optimization is still possible.

> On that side, does architecture level B+ suggest using VEX encoding
> everywhere?  It would be indeed nice to have the architecture levels
> documented in the psABI.

I think this falls under optimization, and I really did not want to
discuss.

If there is a plan to change/amend the calling convention and some of
the levels should prefer to that, it's a different matter, of course.
(glibc can only give you four callee-saved 256-bit wide registers
easily, though, more would need close cooperation with GCC.)

The new glibc-hwcaps scheme in glibc scales a bit better than the old
one, so we do not have to settle this immediately and could add
additional subdirectories for objects that follow new calling convention
requirements.

>> 2. I have a library with AVX2 and FMA, which directory should it go?
>
> Eventually GCC/gas can annotate objects with the lowest architecture
> level that is applicable?

H.J. has patches for ELF program properties.  I think
GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information.  This
proposal and the glibc patches are independent of that.

If that function ever gets deployed, I plan to add those notes to
ld.so.cache, so that ld.so can select shared objects based on them (or
any allocated ELF note, really).  Efficient LD_LIBRARY_PATH support is
not possible, I think, so those designated glibc-hwcaps subdirectories
still have a place.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Jan Beulich-2
On 13.07.2020 09:40, Florian Weimer wrote:
> * Richard Biener:
>>> 2. I have a library with AVX2 and FMA, which directory should it go?
>>
>> Eventually GCC/gas can annotate objects with the lowest architecture
>> level that is applicable?
>
> H.J. has patches for ELF program properties.  I think
> GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information.  This
> proposal and the glibc patches are independent of that.

From (partly just halfway) recent discussions with H.J. I gained
the understanding that the piece we're aiming at getting to work
properly is the recording of GNU_PROPERTY_X86_FEATURE_2_*, not
so much GNU_PROPERTY_X86_ISA_1_*. If the ISA one is to be used as
a basis here, a lot of new flags will need adding (and properly
setting) first, I think.

Jan
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Joseph Myers
* Joseph Myers:

> On Fri, 10 Jul 2020, Florian Weimer via Gcc wrote:
>
>> * Level A
>>
>> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
>>
>> This is one step above the K8 baseline and corresponds to a mainline CPU
>> model ca. 2008 to 2011.  It is also implemented by recent-ish
>> generations of Intel Atom server CPUs (although I haven't tested the
>> latest version).  A 32-bit variant would have to list many additional
>> CPU features here.
>
> FWIW, this is also the limit of what can be run under QEMU emulation, as
> QEMU lacks support for AVX and newer instruction set features.

Oh, I had forgotten about.  I should have Cc:ed the QEMU folks as well.
We'll need to make sure that we have matching CPU models in
QEMU/libvirt, even for the levels that do not have TCG support.

valgrind is another consumer, but in my tests, it was mostly okay with
AVX2 code (but that was without auto-vectorization).  AVX-512 is a
different matter, but that is also much further out.

> On the other hand, virtual machines seem liable to report something closer
> to the K8 baseline to the guest OS, missing the level A features, even
> when the underlying hardware supports everything in level B or level C.

They do this to support migration.  I'm suspect that in many cases,
those are just configuration errors.  That's why I want at least one
major distribution to switch to Level C as the baseline, to clean the
pipes.  Then even those distributions that depend on run-time selection
for performance-critical code will benefit. 8-/

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Sourceware - libc-alpha mailing list
On Mon, Jul 13, 2020 at 9:40 AM Florian Weimer <[hidden email]> wrote:

>
> * Richard Biener:
>
> >> Looks good.  I like it.
> >
> > Likewise.  Btw, did you check that VIA family chips slot into Level A
> > at least?
>
> Those seem to lack SSE4.2, so they land in the baseline.
>
> > Where do AMD bdverN slot in?
>
> bdver1 to bdver3 (as defined by GCC) should land in Level B (so Level A
> if that is dropped).  bdver4 and znver1 (and later) should land in
> Level C.
>
> >>  My only concerns are
> >>
> >> 1. Names like “x86-100”, “x86-101”, what features do they support?
> >
> > Indeed I didn't get the -100, -101 part.  On the GCC side I'd have
> > suggested -march=generic-{A,B,C,D} implying the respective
> > -mtune.
>
> With literal A, B, C, D, or are they just placeholders?  If not literal
> levels, then what we should use there?
>
> I like the simplicity of numbers.  I used letters in the proposal to
> avoid confusion if we alter the proposal by dropping or levels, shifting
> the meaning of those that come later.  I expect to switch back to
> numbers again for the final version.

They are indeed placeholders though I somehow prefer letters to
numbers.  But this is really bike-shedding territory.  Good documentation
on the tools side will be more imporant as well as consistent spelling
between tools sets, possibly driven by a good choice from within the
psABI document.

Richard.
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Sourceware - libc-alpha mailing list
On Sun, Jul 12, 2020 at 11:49 PM Florian Weimer <[hidden email]> wrote:
>
> * H. J. Lu:
>
> > Looks good.  I like it.
>
> Thanks.  What do you think about Level B?  Should we keep it?

Please drop Level B.

> > My only concerns are
> >
> > 1. Names like “x86-100”, “x86-101”, what features do they support?
>
> I think we can add more diagnostic output to ld.so --help.  My patch
> does not show individual CPU flags, but I agree this could be useful.
> (It's not needed for the legacy HWCAP subdirectories because in general,
> those are named & defined by the kernel, not by individually named CPU
> feature flags.)
>
> > 2. I have a library with AVX2 and FMA, which directory should it go?
> >
> > Can we pass such info to ld.so and ld.so prints out the best directory
> > name?
>
> I think this would require generating matching GNU property notes (list
> the CPU features required by the binary).  Once we have that, we can add

I have turned on -mx86-used-note=yes by default for binutils 2.36.
I will add more ISAs bits after we determine which ISAs will be used.
But compilers need to generate GNU_PROPERTY_X86_ISA_1_NEEDED
property.

> something to binutils or indeed ld.so to analyze them and print the
> recommended directory.  But I think this is something that could come
> later.
>
> We can also write a GCC header which looks at macros such as __AVX2__
> and prints a #warning with the recommended directory name.  Checking for
> excess flags will be tricky in this context, though, and if we miss
> something, a wrong recommendation will be the result.
>
> Thanks,
> Florian


--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Jan Beulich-2
On Mon, Jul 13, 2020 at 12:48 AM Jan Beulich <[hidden email]> wrote:

>
> On 13.07.2020 09:40, Florian Weimer wrote:
> > * Richard Biener:
> >>> 2. I have a library with AVX2 and FMA, which directory should it go?
> >>
> >> Eventually GCC/gas can annotate objects with the lowest architecture
> >> level that is applicable?
> >
> > H.J. has patches for ELF program properties.  I think
> > GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information.  This
> > proposal and the glibc patches are independent of that.
>
> From (partly just halfway) recent discussions with H.J. I gained
> the understanding that the piece we're aiming at getting to work
> properly is the recording of GNU_PROPERTY_X86_FEATURE_2_*, not
> so much GNU_PROPERTY_X86_ISA_1_*. If the ISA one is to be used as
> a basis here, a lot of new flags will need adding (and properly
> setting) first, I think.
>

We can update GNU_PROPERTY_X86_ISA_1_* as needed.

--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
On Mon, Jul 13, 2020 at 06:31:31AM -0700, H.J. Lu via Gcc wrote:

> > > H.J. has patches for ELF program properties.  I think
> > > GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information.  This
> > > proposal and the glibc patches are independent of that.
> >
> > From (partly just halfway) recent discussions with H.J. I gained
> > the understanding that the piece we're aiming at getting to work
> > properly is the recording of GNU_PROPERTY_X86_FEATURE_2_*, not
> > so much GNU_PROPERTY_X86_ISA_1_*. If the ISA one is to be used as
> > a basis here, a lot of new flags will need adding (and properly
> > setting) first, I think.
> >
>
> We can update GNU_PROPERTY_X86_ISA_1_* as needed.

I am not really sure such properties are a good idea, it will be a
maintainability nightmare (as it is on other OSes like Solaris).
Think about function multiversioning, target attribute for just some
functions, #pragma omp declare simd.  How do you differentiate between
using those on carefully written code that handles cpuid detection itself or
uses compiler support for that, where we do not want to mark the objects in
any way, they should work just fine even on K8, and cases where users want
something like that?

E.g. look for -mclear-hwcap stuff needed for Solaris because of that.

        Jakub

Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Mark Wielaard
In reply to this post by Sourceware - libc-alpha mailing list
Hi Florian,

I understand you want to discuss the x86_64 micro-architecture levels
only in this thread, but it would be nice to have a similar discussion
for other architectures.

One thing that wasn't clear to me from this proposal is how the glibc
dynamic loader checks for the CPU feature flags. This is important for
valgrind since it can communicate those through different means. cpuid
interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM
at the moment) and of course we can generate SIGILL for unsupported
instructions. We currently don't intercept /proc/cpuinfo (but could).

I think it is important to be precise here, because in the past this
has sometimes caused confusion. For example for how to check correctly
for avx, lzcnt, or fma[4] support.

Thanks,

Mark

P.S. I don't particular like the numbered names, but well, bike-shed...
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
On Wed, Jul 15, 2020 at 7:38 AM Mark Wielaard <[hidden email]> wrote:

>
> Hi Florian,
>
> I understand you want to discuss the x86_64 micro-architecture levels
> only in this thread, but it would be nice to have a similar discussion
> for other architectures.
>
> One thing that wasn't clear to me from this proposal is how the glibc
> dynamic loader checks for the CPU feature flags. This is important for
> valgrind since it can communicate those through different means. cpuid
> interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM
> at the moment) and of course we can generate SIGILL for unsupported
> instructions. We currently don't intercept /proc/cpuinfo (but could).

In library, we can use <sys/platform/x86.h>:

https://sourceware.org/pipermail/libc-alpha/2020-June/115546.html

In GCC, we can use __builtin_cpu_supports.

<sys/platform/x86.h> supports all features and __builtin_cpu_supports in
GCC 11 supports all features which GCC has codegen for.

> I think it is important to be precise here, because in the past this
> has sometimes caused confusion. For example for how to check correctly
> for avx, lzcnt, or fma[4] support.
>
> Thanks,
>
> Mark
>
> P.S. I don't particular like the numbered names, but well, bike-shed...



--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
In reply to this post by Mark Wielaard
* Mark Wielaard:

> One thing that wasn't clear to me from this proposal is how the glibc
> dynamic loader checks for the CPU feature flags. This is important for
> valgrind since it can communicate those through different means. cpuid
> interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM
> at the moment) and of course we can generate SIGILL for unsupported
> instructions. We currently don't intercept /proc/cpuinfo (but could).

glibc uses CPUID in combination with XGETBV.  There is also a masking
feature which I have not reviewed, but given that it only takes features
away, I don't think it matters to valgrind.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

RE: New x86-64 micro-architecture levels

Mallappa, Premachandra
In reply to this post by Sourceware - libc-alpha mailing list
[AMD Public Use]

Hi Floarian,

> I'm including a proposal for the levels below.  I use single letters for them, but I expect that the concrete implementation of this proposal will use
> names like “x86-100”, “x86-101”, like in the glibc patch referenced above.  (But we can discuss other approaches.)

Personally I am not a big fan of this, for 2 reasons
1. uses just x86 in name on x86_64 as well
2. 100/101 not very intuitive


> * Level A
...
> * Level B
> This step is so small that it probably can be dropped, unless the benefits from using VEX encoding are truly significant.

Yes, Agree, the delta is too small, can be clubbed into A or C.

> * Level C
> * Level D

Others are inline with the what we expect as logical grouping.

As you mentioned it is not easy tackle this,
Also we would also like to have dynamic loader support for "zen" / "zen2" as a version of "Level D" and takes preference over Level D,
which may have super-optimized libraries from AMD or other vendors.
These libraries are expected to be optimized according to micro-architectural details, not just ISA.

Probably we can discuss this on the hwcaps thread.

-Prem
Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
* Premachandra Mallappa:

> [AMD Public Use]
>
> Hi Floarian,
>
>> I'm including a proposal for the levels below.  I use single letters for them, but I expect that the concrete implementation of this proposal will use
>> names like “x86-100”, “x86-101”, like in the glibc patch referenced above.  (But we can discuss other approaches.)
>
> Personally I am not a big fan of this, for 2 reasons
> 1. uses just x86 in name on x86_64 as well

That's deliberate, so that we can use the same x86-* names for 32-bit
library selection (once we define matching micro-architecture levels
there).

GCC has -m32 -march=x86-64 for K8 without 3DNow! (essentially the shared
x86-64/EMT64 baseline), but I find this a bit confusing.

> 2. 100/101 not very intuitive

Any suggestions?  The advantage is that these numbers show a strong
preference ordering.  They do make in false suggestions about feature
sets: if we named Level C "x86-avx2", it would still be wrong for glibc
to load libraries found in that directory just because a system has AVX2
support, because the libraries might also need FMA, based on the Level C
definition).  On the GCC side, it avoids a confusion between -mavx2 and
-march=x86-avx2.

If numbers are out, what should we use instead?
x86-sse4, x86-avx2, x86-avx512?  Would that work?

>> * Level A
> ...
>> * Level B
>> This step is so small that it probably can be dropped, unless the benefits from using VEX encoding are truly significant.
>
> Yes, Agree, the delta is too small, can be clubbed into A or C.

Let's merge Level B into level C then?

>> * Level C
>> * Level D
>
> Others are inline with the what we expect as logical grouping.

Thanks.

> Also we would also like to have dynamic loader support for "zen" /
> "zen2" as a version of "Level D" and takes preference over Level D,
> which may have super-optimized libraries from AMD or other vendors.

*That* shouldn't be too hard to implement if we can nail down the
selection criteria.  Let's call this Zen-specific Level C x86-zen-avx2
for the sake of exposition.

What's going to be difficult is the choice for a hypothetical Zen
successor that's compatible feature-flag-wise with Level D.

Basically, there are two choices here:

  * Level D wins because it's the more powerful ISA.
  * x86-zen-avx2 wins because it has the Zen architecture optimizations.

There's also a related issue with Level C vs x86-zen-avx2 depending on
how we implement the Zen detection for AMD family numbers in the glibc
dynamic linker.  What I mean by this?  glibc detects that this a Level C
capable Zen-type CPU, but it's not one of the family/model numbers that
were hard-coded into the glibc sources.  What should we do then?  Should
we still prefer the x86-zen-avx2 library over the Level C library?

> These libraries are expected to be optimized according to
> micro-architectural details, not just ISA.

If it's supposed to be generally useful, we really need to document the
selection criteria for the subdirectory and make sure that it matches
what these libraries actually require at run time in terms of ISA.

I want to avoid two things here specifically: A hardware upgrade results
in crashes because we incorrectly load an incompatible library.  And, if
possible: A hardware upgrade (or kernel/hypervisor upgrade that exposes
more of the actual hardware) causes us to drop optimizations, so that
users experience a performance regression.

With the levels I proposed, these aspects are covered.  But if we start
to create vendor-specific forks in the feature progression, things get
complicated.

Do you think we need to figure this out in this iteration?  If yes, then
I really need a semi-formal description of the selection criteria for
this x86-zen-avx2 directory, so that I can passed it along with my psABI
proposal.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: New x86-64 micro-architecture levels

Sourceware - libc-alpha mailing list
I fully agree these names (100/101, A/B/C/D) are not very intuitive, I
recommend using isa tags by year (e.g. x64_2010, x64_2014) like the
python's platform tags (e.g. manylinux2010, manylinux2014).
12