i386: Lazy binding trampoline and vector register usage

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

i386: Lazy binding trampoline and vector register usage

Florian Weimer-5
We have this in sysdeps/i386/Makefile:

# Make sure no code in ld.so uses mm/xmm/ymm/zmm registers on i386 since
# the first 3 mm/xmm/ymm/zmm registers are used to pass vector parameters
# which must be preserved.
# With SSE disabled, ensure -fpmath is not set to use sse either.
rtld-CFLAGS += -mno-sse -mno-mmx -mfpmath=387
ifeq ($(subdir),elf)
CFLAGS-.os += $(if $(filter $(@F),$(patsubst %,%.os,$(all-rtld-routines))),\
                   $(rtld-CFLAGS))

tests-special += $(objpfx)tst-ld-sse-use.out
$(objpfx)tst-ld-sse-use.out: ../sysdeps/i386/tst-ld-sse-use.sh $(objpfx)ld.so
        @echo "Checking ld.so for SSE register use.  This will take a few seconds..."
        $(BASH) $< $(objpfx) '$(NM)' '$(OBJDUMP)' '$(READELF)' > $@; \
        $(evaluate-test)
else
CFLAGS-.os += $(if $(filter rtld-%.os,$(@F)), $(rtld-CFLAGS))
endif

The idea is that we do not need to save and restore vector registers in
the trampoline (or align the stack) if we compile ld.so in such a way
that only general registers are used.  But that does not actually work
in all cases because lazy binding can call malloc, which lives in
libc.so or might even be interposed, and is thus free to use vector
registers.

What should we do about this?  Calling malloc from _dl_fixup is unsafe
for other reasons because lazy binding can happen in signal handlers, so
maybe this would be fixed if we switched to a non-interposable
async-signal-safe allocator?

(I found this by code inspection.  I have not seen actual crashes/wrong
results at run time.)

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: i386: Lazy binding trampoline and vector register usage

H.J. Lu-30
On Wed, Dec 18, 2019 at 2:22 AM Florian Weimer <[hidden email]> wrote:

>
> We have this in sysdeps/i386/Makefile:
>
> # Make sure no code in ld.so uses mm/xmm/ymm/zmm registers on i386 since
> # the first 3 mm/xmm/ymm/zmm registers are used to pass vector parameters
> # which must be preserved.
> # With SSE disabled, ensure -fpmath is not set to use sse either.
> rtld-CFLAGS += -mno-sse -mno-mmx -mfpmath=387
> ifeq ($(subdir),elf)
> CFLAGS-.os += $(if $(filter $(@F),$(patsubst %,%.os,$(all-rtld-routines))),\
>                    $(rtld-CFLAGS))
>
> tests-special += $(objpfx)tst-ld-sse-use.out
> $(objpfx)tst-ld-sse-use.out: ../sysdeps/i386/tst-ld-sse-use.sh $(objpfx)ld.so
>         @echo "Checking ld.so for SSE register use.  This will take a few seconds..."
>         $(BASH) $< $(objpfx) '$(NM)' '$(OBJDUMP)' '$(READELF)' > $@; \
>         $(evaluate-test)
> else
> CFLAGS-.os += $(if $(filter rtld-%.os,$(@F)), $(rtld-CFLAGS))
> endif
>
> The idea is that we do not need to save and restore vector registers in
> the trampoline (or align the stack) if we compile ld.so in such a way
> that only general registers are used.  But that does not actually work
> in all cases because lazy binding can call malloc, which lives in
> libc.so or might even be interposed, and is thus free to use vector
> registers.
>
> What should we do about this?  Calling malloc from _dl_fixup is unsafe
> for other reasons because lazy binding can happen in signal handlers, so
> maybe this would be fixed if we switched to a non-interposable
> async-signal-safe allocator?

Yes, we can do it for i386.

> (I found this by code inspection.  I have not seen actual crashes/wrong
> results at run time.)

Thanks.

--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: i386: Lazy binding trampoline and vector register usage

Szabolcs Nagy-2
In reply to this post by Florian Weimer-5
On 18/12/2019 10:22, Florian Weimer wrote:

> We have this in sysdeps/i386/Makefile:
>
> # Make sure no code in ld.so uses mm/xmm/ymm/zmm registers on i386 since
> # the first 3 mm/xmm/ymm/zmm registers are used to pass vector parameters
> # which must be preserved.
> # With SSE disabled, ensure -fpmath is not set to use sse either.
> rtld-CFLAGS += -mno-sse -mno-mmx -mfpmath=387
> ifeq ($(subdir),elf)
> CFLAGS-.os += $(if $(filter $(@F),$(patsubst %,%.os,$(all-rtld-routines))),\
>   $(rtld-CFLAGS))
>
> tests-special += $(objpfx)tst-ld-sse-use.out
> $(objpfx)tst-ld-sse-use.out: ../sysdeps/i386/tst-ld-sse-use.sh $(objpfx)ld.so
> @echo "Checking ld.so for SSE register use.  This will take a few seconds..."
> $(BASH) $< $(objpfx) '$(NM)' '$(OBJDUMP)' '$(READELF)' > $@; \
> $(evaluate-test)
> else
> CFLAGS-.os += $(if $(filter rtld-%.os,$(@F)), $(rtld-CFLAGS))
> endif
>
> The idea is that we do not need to save and restore vector registers in
> the trampoline (or align the stack) if we compile ld.so in such a way
> that only general registers are used.  But that does not actually work
> in all cases because lazy binding can call malloc, which lives in
> libc.so or might even be interposed, and is thus free to use vector
> registers.
>
> What should we do about this?  Calling malloc from _dl_fixup is unsafe
> for other reasons because lazy binding can happen in signal handlers, so
> maybe this would be fixed if we switched to a non-interposable
> async-signal-safe allocator?

note that ifunc resolvers can also run during lazy binding
and those can execute arbitrary user code (even if the
allocator issue is fixed).
Reply | Threaded
Open this post in threaded view
|

Re: i386: Lazy binding trampoline and vector register usage

Carlos O'Donell-5
On 1/9/20 5:49 AM, Szabolcs Nagy wrote:

> On 18/12/2019 10:22, Florian Weimer wrote:
>> We have this in sysdeps/i386/Makefile:
>>
>> # Make sure no code in ld.so uses mm/xmm/ymm/zmm registers on i386 since
>> # the first 3 mm/xmm/ymm/zmm registers are used to pass vector parameters
>> # which must be preserved.
>> # With SSE disabled, ensure -fpmath is not set to use sse either.
>> rtld-CFLAGS += -mno-sse -mno-mmx -mfpmath=387
>> ifeq ($(subdir),elf)
>> CFLAGS-.os += $(if $(filter $(@F),$(patsubst %,%.os,$(all-rtld-routines))),\
>>   $(rtld-CFLAGS))
>>
>> tests-special += $(objpfx)tst-ld-sse-use.out
>> $(objpfx)tst-ld-sse-use.out: ../sysdeps/i386/tst-ld-sse-use.sh $(objpfx)ld.so
>> @echo "Checking ld.so for SSE register use.  This will take a few seconds..."
>> $(BASH) $< $(objpfx) '$(NM)' '$(OBJDUMP)' '$(READELF)' > $@; \
>> $(evaluate-test)
>> else
>> CFLAGS-.os += $(if $(filter rtld-%.os,$(@F)), $(rtld-CFLAGS))
>> endif
>>
>> The idea is that we do not need to save and restore vector registers in
>> the trampoline (or align the stack) if we compile ld.so in such a way
>> that only general registers are used.  But that does not actually work
>> in all cases because lazy binding can call malloc, which lives in
>> libc.so or might even be interposed, and is thus free to use vector
>> registers.
>>
>> What should we do about this?  Calling malloc from _dl_fixup is unsafe
>> for other reasons because lazy binding can happen in signal handlers, so
>> maybe this would be fixed if we switched to a non-interposable
>> async-signal-safe allocator?
>
> note that ifunc resolvers can also run during lazy binding
> and those can execute arbitrary user code (even if the
> allocator issue is fixed).

(1) ifunc resolvers

I think ifunc resolvers are a unique problem that needs to be handled by
specific solutions for ifunc.

I think the resolvers should not run during lazy binding. They are effectively
user code and should be handled more like initializers, and that means processing
them up-front to ensure dlopen completes successfully.

(2) Non-interposable AS-safe allocator.

The history behind having a non-interposable AS-safe allocator looks like this:

- Google proposed and wrote patches to create a no-interpose AS-safe allocator.
  https://www.sourceware.org/ml/libc-alpha/2013-09/msg00721.html

- We accepted the patches and they solved the lazy TLS allocation on first use
  in signal handler bug which can cause calloc to be called illegally from a signal
  handler if it happens you touch TLS for the first time in a signal handler.

- We subsequently had reports of tooling, I can't remember which, one of the
  sanitizers, loosing track of TLS entirely because of this new internal allocator.

- We reverted the patches.

In the end we accepted that for TLS the allocation should just happen upfront
at dlopen time.

I'm not entirely sold on the idea of having to do all the allocations upfront
and I *like* the idea of a non-interposable AS-safe allocator for ld/libc's
own internal uses that a user never sees and can never observe. So for example
if we have internal book keeping to allocate for TLS then we can use that
allocator to create the details of the book keeping. However, this must be
balanced against the users desire to control their own allocation strategy.
Therefore they must be able to have some control over larger allocations
and their placement via the malloc family APIs where possible.

In summary:
- We need to keep using malloc for users to be able to interpose.
- For internal book keeping we could use a non-interposable AS-safe allocator.
- I don't know if we can solve our current problems entirely with a
  non-interposable AS-safe allocator.

(3) Calling malloc API functions from from _dl_fixup is wrong.

In the case of _dl_fixup we may need call add_dependency and malloc new link maps.
This is all wrong for lazy binding. This is more work that needs to be moved to
*before* we commit to ever running anything in that library.

The new link maps are part of our scope tracking and with lots of DSOs this could
be quite a bit of memory hidden inside a local allocator or allocated up-front.

--
Cheers,
Carlos.