Invalid program counters and unwinding

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Invalid program counters and unwinding

Florian Weimer-5
I'm looking at ways to speed up _Unwind_Find_FDE when libgcc is running
on top of glibc.  I have something (at the design level, with some of
the code written) which allows me to get a pointer to the
PT_GNU_EH_FRAME segment in memory in a lock-free fashion (so it would
also be async-signal safe).

This part works also when the program counter used in the search is
invalid and does not point to within a loaded object, even in the case
of concurrent dlopen/dlclose.

However, it's still necessary to read the PT_GNU_EH_FRAME data itself,
and if _Unwind_Find_FDE is not a valid program counter found on the
stack (with in a caller, where unmapping it with dlclose would be
invalid), it could happen that it is a random address in *another*,
unrelated object, which then gets dlclose'd (which is valid).

The current glibc-based implementation in libgcc calls dl_iterate_phdr,
which acquires a lock blocking dlclose for the entire duration of the
iteration.  But I think this still doesn't support arbitrary, random PC
values because in the worst case, the PC value looks valid, we find some
unrelated FDE data with an associated personality routine, and end up
calling that, with disastrous consequences.

So it looks to me that the caller of _Unwind_Find_FDE needs to ensure
that the PC is a valid element of the call stack.  Is this a correct
assumption?

I have some ideas how make reading the PT_GNU_EH_FRAME data safe, but
the question is whether we actually need that.

Previous discussions:

https://gcc.gnu.org/ml/gcc/2013-05/msg00253.html
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744
https://sourceware.org/ml/libc-alpha/2016-07/msg00613.html
   (patch with a spread lock, still not async-signal-safe)

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Nathan Sidwell-2
On 06/26/2018 07:01 AM, Florian Weimer wrote:

> On 06/26/2018 12:56 PM, Nathan Sidwell wrote:
>> On 06/26/2018 05:26 AM, Florian Weimer wrote:
>>
>>> So it looks to me that the caller of _Unwind_Find_FDE needs to ensure
>>> that the PC is a valid element of the call stack.  Is this a correct
>>> assumption?
>>
>> I thought this was an (implicit?) requirement. You're unwinding a
>> stack to deliver an exception up it.  Are there use cases where that
>> is not the case?
>
> We have something approaching this scenario.
>
> pthread_cancel in glibc unwinds the stack using DWARF information until
> encounters a frame without unwind information, when it switches to
> longjmp to get past that obstacle.

This is a long jump to the originating pthread function at the end of
the stack, right?  We not only get past the obstacle, but any and all
DWARF frames on top of it.  (just for my understanding)

> However, at the point of transition from a valid DWARF frame into the
> wilderness (without unwind data), we should still have accurate
> information on the caller's PC, so _Unwind_Find_FDE will reliably fail
> to find any unwind data for it.  It's not a random pointer somewhere
> else, so I think even the pthread_cancel case is fully supported.

That sounds right.  It's a PC that you could return to if you weren't
trying to unwind the stack.

nathan

--
Nathan Sidwell
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Florian Weimer-5
On 06/26/2018 01:35 PM, Nathan Sidwell wrote:

> On 06/26/2018 07:21 AM, Florian Weimer wrote:
>
>> GCC doesn't do this AFAIK, but it's theoretically possible not to
>> preserve the return address for a noreturn function.  But that would
>> be very bad for exception handling, so let's hope compilers don't do
>> this.
>
> I'd forgotten about noreturn.  Such functions may terminate by throwing
> an exception (and for our purposes I think pthread_cancel implementatio
> is sufficiently exception-like):
>
> from C++ std:  [dcl.attr.noreturn]/2 [ Note: The function may terminate
> by throwing an exception. — end note ]
>
> and from doc/extend.texi:
>
> The @code{noreturn} keyword does not affect the exceptional path
> when that applies: a @code{noreturn}-marked function may still
> return to the caller by throwing an exception or calling
> @code{longjmp}.
>
> IIRC, in gcc-land you have to give both noreturn and nothrow attributes
> to make it non-unwindable.

Are you sure?  I was under the impression that GCC did not do this
because it interferes too much with debugging.

Furthermore, glibc marks abort as nothrow and noreturn, which is a bit
dubious, considering that it is perfectly fine to throw exception from
synchronously delivered signals.

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Jakub Jelinek
On Tue, Jun 26, 2018 at 01:39:18PM +0200, Florian Weimer wrote:

> > The @code{noreturn} keyword does not affect the exceptional path
> > when that applies: a @code{noreturn}-marked function may still
> > return to the caller by throwing an exception or calling
> > @code{longjmp}.
> >
> > IIRC, in gcc-land you have to give both noreturn and nothrow attributes
> > to make it non-unwindable.
>
> Are you sure?  I was under the impression that GCC did not do this because
> it interferes too much with debugging.
>
> Furthermore, glibc marks abort as nothrow and noreturn, which is a bit
> dubious, considering that it is perfectly fine to throw exception from
> synchronously delivered signals.

There is unwindable and unwindable.  The default on many target is
-fasynchronous-unwind-tables, where at every instruction .eh_frame contains
everything needed to unwind.  nothrow is only about .gcc_except_table if
there is some unwind region etc.  So you can still do _Unwind_Backtrace if
you have noexcept functions.
I don't see what is of interest on noreturn functions, either the calls
them, then you have unwind info as usual, or it can tailcall them (that is
what GCC usually doesn't do) and then it is like any other tail call, you
can't see the original caller, but you can see a caller of that caller (or
perhaps with a bunch of other frames missing), but still something
unwindable.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Jeff Law
In reply to this post by Florian Weimer-5
On 06/28/2018 06:30 AM, Florian Weimer wrote:

> On 06/28/2018 04:16 AM, Jeff Law wrote:
>>> Previous discussions:
>>>
>>> https://gcc.gnu.org/ml/gcc/2013-05/msg00253.html
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744
>>> https://sourceware.org/ml/libc-alpha/2016-07/msg00613.html
>>>    (patch with a spread lock, still not async-signal-safe)
>
>> You might also want to look at RH BZ 1293594 which I think has pointers
>> back to an issue from 2008 :(
>
> Interesting.  That does suspiciously look like a concurrent dlclose.
> It's just that the crash handler crashes, after the application crash. I
> think this one is really NOTABUG, both technically and from user impact:
> we do not cause the crash, we just react poorly to the application
> triggering undefined behavior.
>
> In the bug, you mentioned this code fragment for x86-64:
>
> 42        unsigned char *pc = context->ra;
> 43        struct sigcontext *sc;
> 44        long new_cfa;
> 45
> 46        /* movq __NR_rt_sigreturn, %rax ; syscall  */
> 47        if (*(unsigned char *)(pc+0) == 0x48
> 48            && *(unsigned long *)(pc+1) == 0x050f0000000fc0c7)
>
> I'm not sure I agree that it is “dumb”, but I think it proves
> conclusively that you cannot feed random addresses to the unwinder. 8-)
I believe "dumb" is referring to the fact that we're already in a bit of
a weird state as evidenced by the NULL FDE.  Blindly trying to read the
contents of the PC that we couldn't map to an FDE is, IMHO, dumb.

One might even be able to argue in this day and age that we should have
suitable descriptors for everything.  If no suitable descriptor is found
then backtracing should stop.  Lack of suitable descriptors in any code
would be considered a bug in that scenario.

jeff
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Michael Matz
Hi,

On Thu, 28 Jun 2018, Jeff Law wrote:

> I believe "dumb" is referring to the fact that we're already in a bit of
> a weird state as evidenced by the NULL FDE.  Blindly trying to read the
> contents of the PC that we couldn't map to an FDE is, IMHO, dumb.
>
> One might even be able to argue in this day and age that we should have
> suitable descriptors for everything.  If no suitable descriptor is found
> then backtracing should stop.  Lack of suitable descriptors in any code
> would be considered a bug in that scenario.

I disagree.  ASM code often lacks unwind descriptors (now less than in the
past, but still).  My rule of thumb is always: no descriptor -> has to be
a framepointer-using routine with standard calling sequence.  (I.e.
declare the combination of no descriptor and no fp to be a bug).  Some of
the callee-saved register will temporarily be wrong but unwinding can
continue.


Ciao,
Michael.
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Jakub Jelinek
On Mon, Jul 02, 2018 at 05:48:32PM +0200, Michael Matz wrote:

> Hi,
>
> On Thu, 28 Jun 2018, Jeff Law wrote:
>
> > I believe "dumb" is referring to the fact that we're already in a bit of
> > a weird state as evidenced by the NULL FDE.  Blindly trying to read the
> > contents of the PC that we couldn't map to an FDE is, IMHO, dumb.
> >
> > One might even be able to argue in this day and age that we should have
> > suitable descriptors for everything.  If no suitable descriptor is found
> > then backtracing should stop.  Lack of suitable descriptors in any code
> > would be considered a bug in that scenario.
>
> I disagree.  ASM code often lacks unwind descriptors (now less than in the
> past, but still).  My rule of thumb is always: no descriptor -> has to be
> a framepointer-using routine with standard calling sequence.  (I.e.
> declare the combination of no descriptor and no fp to be a bug).  Some of
> the callee-saved register will temporarily be wrong but unwinding can
> continue.

Doesn't that clash with the x86-64 ABI which says what kind of FDE use by
default if none is found (essentially a standard leaf routine that doesn't
change sp, nor save any registers)?

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Michael Matz
Hi,

On Mon, 2 Jul 2018, Jakub Jelinek wrote:

> > I disagree.  ASM code often lacks unwind descriptors (now less than in
> > the past, but still).  My rule of thumb is always: no descriptor ->
> > has to be a framepointer-using routine with standard calling sequence.  
> > (I.e. declare the combination of no descriptor and no fp to be a bug).  
> > Some of the callee-saved register will temporarily be wrong but
> > unwinding can continue.
>
> Doesn't that clash with the x86-64 ABI which says what kind of FDE use
> by default if none is found (essentially a standard leaf routine that
> doesn't change sp, nor save any registers)?

There is no such language in the psABI, no (at least I haven't found
anything; you had me worried for a moment :) ).  But there's stronger one:
all functions through which unwinding is expected must provide CFI.  So,
yes, such code isn't strictly conforming.  But there we are, there's much
code that isn't and we still have to sensibly deal with it (if we can).  
IMHO making guesses is better than to stop unwinding.  And IMHO guessing
that it's FP-using is better than guessing that it's leaf, especially if
the PC in question was the result of a prior unwinding step (making it
clear that it certainly was _not_ leaf).

And then there are (toy?) compilers that don't emit CFI, but do use FPs
(totally psABI non-compliant, sure); IMHO we shouldn't pessimize them
unduly.  Yeah, it's all a bit wonky, but why make it harder for those?


Ciao,
Michael.
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Florian Weimer-5
On 07/02/2018 06:14 PM, Michael Matz wrote:
> There is no such language in the psABI, no (at least I haven't found
> anything; you had me worried for a moment :) ).  But there's stronger one:
> all functions through which unwinding is expected must provide CFI.  So,
> yes, such code isn't strictly conforming.  But there we are, there's much
> code that isn't and we still have to sensibly deal with it (if we can).
> IMHO making guesses is better than to stop unwinding.  And IMHO guessing
> that it's FP-using is better than guessing that it's leaf, especially if
> the PC in question was the result of a prior unwinding step (making it
> clear that it certainly was_not_  leaf).

Well, the previous frame could have been a signal handler frame, but I
see your point.

Anyway, I've proposed a BoF for these topics for the next Cauldron.

Thanks,
Florian