Re: Invalid program counters and unwinding

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Nathan Sidwell-2
On 06/26/2018 05:26 AM, Florian Weimer wrote:

> So it looks to me that the caller of _Unwind_Find_FDE needs to ensure
> that the PC is a valid element of the call stack.  Is this a correct
> assumption?

I thought this was an (implicit?) requirement. You're unwinding a stack
to deliver an exception up it.  Are there use cases where that is not
the case?

nathan

--
Nathan Sidwell
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Florian Weimer-5
On 06/26/2018 12:56 PM, Nathan Sidwell wrote:
> On 06/26/2018 05:26 AM, Florian Weimer wrote:
>
>> So it looks to me that the caller of _Unwind_Find_FDE needs to ensure
>> that the PC is a valid element of the call stack.  Is this a correct
>> assumption?
>
> I thought this was an (implicit?) requirement. You're unwinding a stack
> to deliver an exception up it.  Are there use cases where that is not
> the case?

We have something approaching this scenario.

pthread_cancel in glibc unwinds the stack using DWARF information until
encounters a frame without unwind information, when it switches to
longjmp to get past that obstacle.

However, at the point of transition from a valid DWARF frame into the
wilderness (without unwind data), we should still have accurate
information on the caller's PC, so _Unwind_Find_FDE will reliably fail
to find any unwind data for it.  It's not a random pointer somewhere
else, so I think even the pthread_cancel case is fully supported.

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Florian Weimer-5
On 06/26/2018 01:15 PM, Nathan Sidwell wrote:

> On 06/26/2018 07:01 AM, Florian Weimer wrote:
>> On 06/26/2018 12:56 PM, Nathan Sidwell wrote:
>>> On 06/26/2018 05:26 AM, Florian Weimer wrote:
>>>
>>>> So it looks to me that the caller of _Unwind_Find_FDE needs to
>>>> ensure that the PC is a valid element of the call stack.  Is this a
>>>> correct assumption?
>>>
>>> I thought this was an (implicit?) requirement. You're unwinding a
>>> stack to deliver an exception up it.  Are there use cases where that
>>> is not the case?
>>
>> We have something approaching this scenario.
>>
>> pthread_cancel in glibc unwinds the stack using DWARF information
>> until encounters a frame without unwind information, when it switches
>> to longjmp to get past that obstacle.
>
> This is a long jump to the originating pthread function at the end of
> the stack, right?  We not only get past the obstacle, but any and all
> DWARF frames on top of it.  (just for my understanding)

Essentially yes.  It can also be an intermediate jump buffer, used to to
support compilation in -fno-exceptions mode.  In that case, unwinding
tries to proceed from there, again with a valid PC.

> That sounds right.  It's a PC that you could return to if you weren't trying to unwind the stack.

GCC doesn't do this AFAIK, but it's theoretically possible not to
preserve the return address for a noreturn function.  But that would be
very bad for exception handling, so let's hope compilers don't do this.

Florian
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Jakub Jelinek
In reply to this post by Florian Weimer-5
On Tue, Jun 26, 2018 at 01:01:06PM +0200, Florian Weimer wrote:

> On 06/26/2018 12:56 PM, Nathan Sidwell wrote:
> > On 06/26/2018 05:26 AM, Florian Weimer wrote:
> >
> > > So it looks to me that the caller of _Unwind_Find_FDE needs to
> > > ensure that the PC is a valid element of the call stack.  Is this a
> > > correct assumption?
> >
> > I thought this was an (implicit?) requirement. You're unwinding a stack
> > to deliver an exception up it.  Are there use cases where that is not
> > the case?
>
> We have something approaching this scenario.
>
> pthread_cancel in glibc unwinds the stack using DWARF information until
> encounters a frame without unwind information, when it switches to longjmp
> to get past that obstacle.
>
> However, at the point of transition from a valid DWARF frame into the
> wilderness (without unwind data), we should still have accurate information
> on the caller's PC, so _Unwind_Find_FDE will reliably fail to find any
> unwind data for it.  It's not a random pointer somewhere else, so I think
> even the pthread_cancel case is fully supported.

The usual ways to get bogus PCs in the frames is:
1) stack corruption
2) setcontext/swapcontext with uninitialized or corrupted ucontext_t
3) bogus unwind info (compiler or linker etc. bug)

At least for unwinding, I think we don't and shouldn't care, we assume only
valid programs.  For cases like _Unwind_Backtrace when used to print info in
case of fatal signal or stack corruption, it is more questionable, but at
least the current implmentation doesn't care either.  There have been some
requests e.g. to use extremely slow safe accesses like syscalls from the
potential invalid memory, or mincore, or parsing of /proc/self/maps to make
it work even if everything is corrupted, but so far nothing thankfully made
it in.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Florian Weimer-5
On 06/26/2018 01:25 PM, Jakub Jelinek wrote:

> On Tue, Jun 26, 2018 at 01:01:06PM +0200, Florian Weimer wrote:
>> On 06/26/2018 12:56 PM, Nathan Sidwell wrote:
>>> On 06/26/2018 05:26 AM, Florian Weimer wrote:
>>>
>>>> So it looks to me that the caller of _Unwind_Find_FDE needs to
>>>> ensure that the PC is a valid element of the call stack.  Is this a
>>>> correct assumption?
>>>
>>> I thought this was an (implicit?) requirement. You're unwinding a stack
>>> to deliver an exception up it.  Are there use cases where that is not
>>> the case?
>>
>> We have something approaching this scenario.
>>
>> pthread_cancel in glibc unwinds the stack using DWARF information until
>> encounters a frame without unwind information, when it switches to longjmp
>> to get past that obstacle.
>>
>> However, at the point of transition from a valid DWARF frame into the
>> wilderness (without unwind data), we should still have accurate information
>> on the caller's PC, so _Unwind_Find_FDE will reliably fail to find any
>> unwind data for it.  It's not a random pointer somewhere else, so I think
>> even the pthread_cancel case is fully supported.
>
> The usual ways to get bogus PCs in the frames is:
> 1) stack corruption
> 2) setcontext/swapcontext with uninitialized or corrupted ucontext_t
> 3) bogus unwind info (compiler or linker etc. bug)

But if that happens, all bets are off, and we could still get a crash
with the current implementation.

And any approach which does not inhibit concurrent dlclose will only
make things worse if there are such concurrent calls, which is perhaps
an unusual combination.

> At least for unwinding, I think we don't and shouldn't care, we assume only
> valid programs.  For cases like _Unwind_Backtrace when used to print info in
> case of fatal signal or stack corruption, it is more questionable, but at
> least the current implmentation doesn't care either.

At least glibc no longer tries to print a backtrace from a corrupted stack.

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Nathan Sidwell-2
In reply to this post by Florian Weimer-5
On 06/26/2018 07:21 AM, Florian Weimer wrote:

> GCC doesn't do this AFAIK, but it's theoretically possible not to
> preserve the return address for a noreturn function.  But that would be
> very bad for exception handling, so let's hope compilers don't do this.

I'd forgotten about noreturn.  Such functions may terminate by throwing
an exception (and for our purposes I think pthread_cancel implementatio
is sufficiently exception-like):

from C++ std:  [dcl.attr.noreturn]/2 [ Note: The function may terminate
by throwing an exception. — end note ]

and from doc/extend.texi:

The @code{noreturn} keyword does not affect the exceptional path
when that applies: a @code{noreturn}-marked function may still
return to the caller by throwing an exception or calling
@code{longjmp}.

IIRC, in gcc-land you have to give both noreturn and nothrow attributes
to make it non-unwindable.

nathan
--
Nathan Sidwell
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Nathan Sidwell-2
On 06/26/2018 07:39 AM, Florian Weimer wrote:
> On 06/26/2018 01:35 PM, Nathan Sidwell wrote:

>> IIRC, in gcc-land you have to give both noreturn and nothrow
>> attributes to make it non-unwindable.
>
> Are you sure?  I was under the impression that GCC did not do this
> because it interferes too much with debugging.

I'm not sure at all.   I remember needing to use nothrow somewhere that
throw() didn't cut it, and could well have got the details wrong, and
the compiler has moved on anyway.

> Furthermore, glibc marks abort as nothrow and noreturn, which is a bit
> dubious, considering that it is perfectly fine to throw exception from
> synchronously delivered signals.

Hm, that seems contradictory ...

nathan

--
Nathan Sidwell
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

John Reiser
In reply to this post by Jakub Jelinek
On 06/26/2018, Jakub Jelinek wrote:

> The usual ways to get bogus PCs in the frames is:
> 1) stack corruption
> 2) setcontext/swapcontext with uninitialized or corrupted ucontext_t
> 3) bogus unwind info (compiler or linker etc. bug)
>
> At least for unwinding, I think we don't and shouldn't care, we assume only
> valid programs.

That assumption is not reliable in practice.  About once per year I find
a totally bogus PC value while unwinding, such as:

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66874
     RFE: x86_64_fallback_frame_state more robust

   https://sourceware.org/bugzilla/show_bug.cgi?id=18635
     stdlib/tst-makecontext fails on ix86

At least *some* run-time effort should be made to protect against SIGSEGV
when a purported PC is 0, or otherwise obviously bogus.
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Jeff Law
In reply to this post by Nathan Sidwell-2
On 06/26/2018 03:26 AM, Florian Weimer wrote:

> I'm looking at ways to speed up _Unwind_Find_FDE when libgcc is running
> on top of glibc.  I have something (at the design level, with some of
> the code written) which allows me to get a pointer to the
> PT_GNU_EH_FRAME segment in memory in a lock-free fashion (so it would
> also be async-signal safe).
>
> This part works also when the program counter used in the search is
> invalid and does not point to within a loaded object, even in the case
> of concurrent dlopen/dlclose.
>
> However, it's still necessary to read the PT_GNU_EH_FRAME data itself,
> and if _Unwind_Find_FDE is not a valid program counter found on the
> stack (with in a caller, where unmapping it with dlclose would be
> invalid), it could happen that it is a random address in *another*,
> unrelated object, which then gets dlclose'd (which is valid).
>
> The current glibc-based implementation in libgcc calls dl_iterate_phdr,
> which acquires a lock blocking dlclose for the entire duration of the
> iteration.  But I think this still doesn't support arbitrary, random PC
> values because in the worst case, the PC value looks valid, we find some
> unrelated FDE data with an associated personality routine, and end up
> calling that, with disastrous consequences.
>
> So it looks to me that the caller of _Unwind_Find_FDE needs to ensure
> that the PC is a valid element of the call stack.  Is this a correct
> assumption?
>
> I have some ideas how make reading the PT_GNU_EH_FRAME data safe, but
> the question is whether we actually need that.
>
> Previous discussions:
>
> https://gcc.gnu.org/ml/gcc/2013-05/msg00253.html
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744
> https://sourceware.org/ml/libc-alpha/2016-07/msg00613.html
>   (patch with a spread lock, still not async-signal-safe)
You might also want to look at RH BZ 1293594 which I think has pointers
back to an issue from 2008 :(

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Florian Weimer-5
On 06/28/2018 04:16 AM, Jeff Law wrote:
>> Previous discussions:
>>
>> https://gcc.gnu.org/ml/gcc/2013-05/msg00253.html
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744
>> https://sourceware.org/ml/libc-alpha/2016-07/msg00613.html
>>    (patch with a spread lock, still not async-signal-safe)

> You might also want to look at RH BZ 1293594 which I think has pointers
> back to an issue from 2008 :(

Interesting.  That does suspiciously look like a concurrent dlclose.
It's just that the crash handler crashes, after the application crash.
I think this one is really NOTABUG, both technically and from user
impact: we do not cause the crash, we just react poorly to the
application triggering undefined behavior.

In the bug, you mentioned this code fragment for x86-64:

42        unsigned char *pc = context->ra;
43        struct sigcontext *sc;
44        long new_cfa;
45
46        /* movq __NR_rt_sigreturn, %rax ; syscall  */
47        if (*(unsigned char *)(pc+0) == 0x48
48            && *(unsigned long *)(pc+1) == 0x050f0000000fc0c7)

I'm not sure I agree that it is “dumb”, but I think it proves
conclusively that you cannot feed random addresses to the unwinder. 8-)

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Invalid program counters and unwinding

Florian Weimer-5
On 06/28/2018 04:18 PM, Jeff Law wrote:

> On 06/28/2018 06:30 AM, Florian Weimer wrote:
>> On 06/28/2018 04:16 AM, Jeff Law wrote:
>>>> Previous discussions:
>>>>
>>>> https://gcc.gnu.org/ml/gcc/2013-05/msg00253.html
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744
>>>> https://sourceware.org/ml/libc-alpha/2016-07/msg00613.html
>>>>     (patch with a spread lock, still not async-signal-safe)
>>
>>> You might also want to look at RH BZ 1293594 which I think has pointers
>>> back to an issue from 2008 :(
>>
>> Interesting.  That does suspiciously look like a concurrent dlclose.
>> It's just that the crash handler crashes, after the application crash. I
>> think this one is really NOTABUG, both technically and from user impact:
>> we do not cause the crash, we just react poorly to the application
>> triggering undefined behavior.
>>
>> In the bug, you mentioned this code fragment for x86-64:
>>
>> 42        unsigned char *pc = context->ra;
>> 43        struct sigcontext *sc;
>> 44        long new_cfa;
>> 45
>> 46        /* movq __NR_rt_sigreturn, %rax ; syscall  */
>> 47        if (*(unsigned char *)(pc+0) == 0x48
>> 48            && *(unsigned long *)(pc+1) == 0x050f0000000fc0c7)
>>
>> I'm not sure I agree that it is “dumb”, but I think it proves
>> conclusively that you cannot feed random addresses to the unwinder. 8-)

> I believe "dumb" is referring to the fact that we're already in a bit of
> a weird state as evidenced by the NULL FDE.  Blindly trying to read the
> contents of the PC that we couldn't map to an FDE is, IMHO, dumb.

I think the code derives from i386, where historically it was necessary
to unwind with the frame pointer only.  I suspect we still need this
code there at least, to support legacy binaries.

For x86-64, we should probably not to attempt without tables at all, but
I have no idea whether that's feasible.

Thanks,
Florian