[Bug linuxthreads/3597] New: Possible race condition in pthread_exit() function resulting in core dump.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug linuxthreads/3597] New: Possible race condition in pthread_exit() function resulting in core dump.

tim@mr-dog.net
Here at MySQL we got a core dump inside pthread_exit() function. One of our
developers did the analysis (quoted below) which shows that the problem might be
related to concurrent execution of pthread_exit() code and aggresive
optimizations made by current compilers.

Suggested fix: declare libgcc_s_getcfa and/or libgcc_s_forcedunwind variables
volatile to prevent keeping them in processor registers or disable context
switches during pthread_exit() execution.

The analysis:

Here is the stacktrace from the core:

    (gdb) bt
    #0  0x005227a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
    #1  0x007898bb in pthread_kill () from /lib/tls/libpthread.so.0
    #2  0x084a1775 in write_core (sig=11) at stacktrace.c:245
    #3  0x0826c995 in handle_segfault (sig=11) at mysqld.cc:2115
    #4  <signal handler called>
    #5  0x00000000 in ?? ()
    #6  0x0078d2aa in _Unwind_ForcedUnwind () from /lib/tls/libpthread.so.0
    #7  0x0078af81 in __pthread_unwind () from /lib/tls/libpthread.so.0
    #8  0x00786f00 in pthread_exit () from /lib/tls/libpthread.so.0
    #9  0x084885d7 in handle_slave_io (arg=0x9f6c858) at slave.cc:3769
    #10 0x00786341 in start_thread () from /lib/tls/libpthread.so.0
    #11 0x006066fe in clone () from /lib/tls/libc.so.6

I have a plausible theory about what is going on.

The crash is in this piece of code from glibc (as found by Google code
search), where pthread_cancel_init() is inlined in _Unwind_Reason_Code():

    _Unwind_Reason_Code
    _Unwind_ForcedUnwind (struct _Unwind_Exception *exc, _Unwind_Stop_Fn stop,
                          void *stop_argument)
    {
      if (__builtin_expect (libgcc_s_forcedunwind == NULL, 0))
        pthread_cancel_init ();
      return libgcc_s_forcedunwind (exc, stop, stop_argument);
    }

    void
    pthread_cancel_init (void)
    {
      void *resume, *personality, *forcedunwind, *getcfa;
      void *handle;

      if (__builtin_expect (libgcc_s_getcfa != NULL, 1))
        return;

      handle = __libc_dlopen ("libgcc_s.so.1");

      if (handle == NULL
          || (resume = __libc_dlsym (handle, "_Unwind_Resume")) == NULL
          || (personality = __libc_dlsym (handle, "__gcc_personality_v0")) == NULL
          || (forcedunwind = __libc_dlsym (handle, "_Unwind_ForcedUnwind"))
             == NULL
          || (getcfa = __libc_dlsym (handle, "_Unwind_GetCFA")) == NULL
    #ifdef ARCH_CANCEL_INIT
          || ARCH_CANCEL_INIT (handle)
    #endif
          )
        __libc_fatal ("libgcc_s.so.1 must be installed for pthread_cancel to work\n");

      libgcc_s_resume = resume;
      libgcc_s_personality = personality;
      libgcc_s_forcedunwind = forcedunwind;
      libgcc_s_getcfa = getcfa;
    }

Note that there is actually a race in this code:

   - Thread A finds libgcc_s_forcedunwind==NULL and enters
     pthread_cancel_init(). A context switch then occurs before thread A has
     the time to check the libgcc_s_getcfa variable.

   - Thread B finds libgcc_s_forcedunwind==NULL and enters
     pthread_cancel_init(). It finds libgcc_s_getcfa==NULL, and goes to set
     libgcc_s_getcfa = getcfa.

   - Thread A is later re-scheduled, and now finds libgcc_s_getcfa!=NULL so
     returns immediately from pthread_cancel_init(). It then proceeds to
     execute the call (*libgcc_s_forcedunwind)() using the _old_ previously
     loaded value in %edx, which is still NULL. Hence a segfault.

So the problem is that the libgcc_s_getcfa variable is checked and modified
without any kind of synchronization.

I actually found some evidence in the core file that this race is exactly what
happened. Here are the registers at the point of crash:

    (gdb) info reg
    eax            0x78ade0 7908832
    ecx            0xb6fa6dd0       -1225101872
    edx            0x0      0
    ebx            0x78fff4 7929844
    esp            0xb6fa6328       0xb6fa6328
    ebp            0xb6fa6348       0xb6fa6348
    esi            0xb6fa6480       -1225104256
    edi            0xb6fa6dd0       -1225101872
    eip            0x78d2aa 0x78d2aa

And here is the disassembly, with some comments.

    0x0078d270 <_Unwind_ForcedUnwind+0>:    push   %ebp
    0x0078d271 <_Unwind_ForcedUnwind+1>:    mov    %esp,%ebp
    0x0078d273 <_Unwind_ForcedUnwind+3>:    sub    $0x20,%esp
    0x0078d276 <_Unwind_ForcedUnwind+6>:    mov    %ebx,0xfffffff4(%ebp)
    0x0078d279 <_Unwind_ForcedUnwind+9>:    call   0x7852da
<__i686.get_pc_thunk.bx>
    0x0078d27e <_Unwind_ForcedUnwind+14>:   add    $0x2d76,%ebx
    0x0078d284 <_Unwind_ForcedUnwind+20>:   mov    %esi,0xfffffff8(%ebp)
    0x0078d287 <_Unwind_ForcedUnwind+23>:   mov    0x21ac(%ebx),%edx

libgcc_s_forcedunwind is now loaded in %edx.

    0x0078d28d <_Unwind_ForcedUnwind+29>:   mov    %edi,0xfffffffc(%ebp)
    0x0078d290 <_Unwind_ForcedUnwind+32>:   test   %edx,%edx
    0x0078d292 <_Unwind_ForcedUnwind+34>:   je     0x78d2b7
<_Unwind_ForcedUnwind+71>

From the register dump, %edx is 0, so this jump is taken.

    0x0078d294 <_Unwind_ForcedUnwind+36>:   mov    0x10(%ebp),%esi
    0x0078d297 <_Unwind_ForcedUnwind+39>:   mov    0x8(%ebp),%edi
    0x0078d29a <_Unwind_ForcedUnwind+42>:   mov    0xc(%ebp),%eax
    0x0078d29d <_Unwind_ForcedUnwind+45>:   mov    %esi,0x8(%esp)
    0x0078d2a1 <_Unwind_ForcedUnwind+49>:   mov    %edi,(%esp)
    0x0078d2a4 <_Unwind_ForcedUnwind+52>:   mov    %eax,0x4(%esp)
    0x0078d2a8 <_Unwind_ForcedUnwind+56>:   call   *%edx

This is where it crashes since %edx (libgcc_s_forcedunwind) is NULL.

    0x0078d2aa <_Unwind_ForcedUnwind+58>:   mov    0xfffffff4(%ebp),%ebx
    0x0078d2ad <_Unwind_ForcedUnwind+61>:   mov    0xfffffff8(%ebp),%esi
    0x0078d2b0 <_Unwind_ForcedUnwind+64>:   mov    0xfffffffc(%ebp),%edi
    0x0078d2b3 <_Unwind_ForcedUnwind+67>:   mov    %ebp,%esp
    0x0078d2b5 <_Unwind_ForcedUnwind+69>:   pop    %ebp
    0x0078d2b6 <_Unwind_ForcedUnwind+70>:   ret    

This is where the code jump to from above when it finds libgcc_s_forcedunwind
to be NULL (it is the inlined pthread_cancel_init() code).

    0x0078d2b7 <_Unwind_ForcedUnwind+71>:   mov    0x21b0(%ebx),%eax
    0x0078d2bd <_Unwind_ForcedUnwind+77>:   test   %eax,%eax
    0x0078d2bf <_Unwind_ForcedUnwind+79>:   jne    0x78d294
<_Unwind_ForcedUnwind+36>

And here it returns immediately, since it loads libgcc_s_getcfa into %eax, and
finds it non-NULL (the register dump shows %eax is 0x78ade0) -> crash.

Some more dumps to show this:

    (gdb) x $ebx+0x21ac
    0x7921a0 <libgcc_s_forcedunwind>:       0x009340e4
    (gdb) x $ebx+0x21b0
    0x7921a4 <libgcc_s_getcfa>:     0x00932b98
    (gdb) x 0x009340e4
    0x9340e4 <_Unwind_ForcedUnwind>:        0x57e58955
    (gdb) x 0x00932b98
    0x932b98 <_Unwind_GetCFA>:      0x8be58955

So the variable libgcc_s_forcedunwind is actually non-NULL at the time of
crash (set by the other thread in the race). But the compiled code naturally
uses the previously loaded value in %edx, having no reason to believe that it
might have changed since it was last loaded. Hence the crash.

  $ ./configure --prefix=/usr/local/mysql --enable-assembler
--with-extra-charsets=complex --enable-thread-safe-clie
nt --with-readline --with-big-tables --with-debug --disable-shared --with-innodb
--with-berkeley-db --with-ndbcluste
r --with-archive-storage-engine --with-big-tables
--with-blackhole-storage-engine --with-federated-storage-engine --
with-csv-storage-engine --with-yassl --with-embedded-server --enable-local-infile

Some build info from config.log follows:


## --------- ##
## Platform. ##
## --------- ##

hostname = <cut>
uname -m = i686
uname -r = 2.6.9-22.0.1.ELsmp
uname -s = Linux
uname -v = #1 SMP Tue Oct 18 18:39:27 EDT 2005

/usr/bin/uname -p = unknown
/bin/uname -X     = unknown

/bin/arch              = i686
/usr/bin/arch -k       = unknown
/usr/convex/getsysinfo = unknown
hostinfo               = unknown
/bin/machine           = unknown
/usr/bin/oslevel       = unknown
/bin/universe          = unknown

--
           Summary: Possible race condition in pthread_exit() function
                    resulting in core dump.
           Product: glibc
           Version: 2.3.4
            Status: NEW
          Severity: normal
          Priority: P2
         Component: linuxthreads
        AssignedTo: drow at false dot org
        ReportedBy: rsomla at mysql dot com
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=3597

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
Reply | Threaded
Open this post in threaded view
|

[Bug linuxthreads/3597] Possible race condition in pthread_exit() function resulting in core dump.

tim@mr-dog.net

------- Additional Comments From drow at false dot org  2006-11-27 18:03 -------
Subject: Re:  New: Possible race condition in pthread_exit() function resulting in core dump.

On Mon, Nov 27, 2006 at 05:48:42PM -0000, rsomla at mysql dot com wrote:
> Suggested fix: declare libgcc_s_getcfa and/or libgcc_s_forcedunwind variables
> volatile to prevent keeping them in processor registers or disable context
> switches during pthread_exit() execution.

This is NPTL, not LinuxThreads - anything in the "tls" subdirectory is
NPTL.

I imagine that you have found the bug which was described here:
  http://sourceware.org/bugzilla/show_bug.cgi?id=2644

Several patches were committed to fix this.  I recommend you obtain a
more recent version of glibc to test.



--


http://sourceware.org/bugzilla/show_bug.cgi?id=3597

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/3597] Possible race condition in pthread_exit() function resulting in core dump.

tim@mr-dog.net
In reply to this post by tim@mr-dog.net


--
           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|drow at false dot org       |drepper at redhat dot com
          Component|linuxthreads                |nptl


http://sourceware.org/bugzilla/show_bug.cgi?id=3597

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/3597] Possible race condition in pthread_exit() function resulting in core dump.

tim@mr-dog.net
In reply to this post by tim@mr-dog.net

------- Additional Comments From rsomla at mysql dot com  2006-11-27 18:10 -------
Subject: Re:  Possible race condition in pthread_exit()
 function resulting in core dump.

drow at false dot org wrote:
> I imagine that you have found the bug which was described here:
>   http://sourceware.org/bugzilla/show_bug.cgi?id=2644

No, I haven't - sorry for that. This looks exactly like the thing I was reporting.

>
> Several patches were committed to fix this.  I recommend you obtain a
> more recent version of glibc to test.

Thanks for pointing out!

Cheers,

Rafal


--


http://sourceware.org/bugzilla/show_bug.cgi?id=3597

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/3597] Possible race condition in pthread_exit() function resulting in core dump.

tim@mr-dog.net
In reply to this post by tim@mr-dog.net

------- Additional Comments From jakub at redhat dot com  2006-11-28 10:31 -------


*** This bug has been marked as a duplicate of 2644 ***

--
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


http://sourceware.org/bugzilla/show_bug.cgi?id=3597

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.