[Bug nptl/23844] New: pthread_rwlock_trywrlock results in hang

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/23844] New: pthread_rwlock_trywrlock results in hang

cvs-commit at gcc dot gnu.org
https://sourceware.org/bugzilla/show_bug.cgi?id=23844

            Bug ID: 23844
           Summary: pthread_rwlock_trywrlock results in hang
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: nptl
          Assignee: unassigned at sourceware dot org
          Reporter: mike at marxmeier dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

Created attachment 11373
  --> https://sourceware.org/bugzilla/attachment.cgi?id=11373&action=edit
Test case

After upgrading from glibc 2.23 to 2.26, we've been seeing what looks like
a hang inside pthread_rwlock calls in our application.

A pthread_rwlock_trywrlock as a "quick path" if a lock is available.  
Otherwise a pthread_rwlock_wrlock is used along with additional
measurement information.

   if((ecode = pthread_rwlock_trywrlock(&rw_lock[idx]))) {
      if(ecode == EBUSY)
         ecode = pthread_rwlock_wrlock(&rw_lock[idx]);
      if(ecode) {
         perror("pthread_rwlock_wrlock");
         exit(1);
      }
   }

This should behave identically to pthread_rwlock_wrlock().
It does result in a hang with the lock not taken but all
threads blocked in pthread_rwlock calls.

The attached test case makes it easily reproducible.

The lock state looks like this:

 {__data = {__lock = 10, __nr_readers = 0, __readers_wakeup = 2,
  __writer_wakeup = 3, __nr_readers_queued = 0, __nr_writers_queued = 0,
  __writer = 0, __shared = 0, __rwelision = 0 '\000',
  __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0},
  __size = "\n\000\000\000\000\000\000\000\002\000\000\000\003", '\000'
<repeats 42 times>, __align = 10}

Threads are suspended in pthread_rwlock_wrlock or pthread_rwlock_rdlock

  2    Thread 0x7fb876c92700 (LWP 28072) "rwl_g" 0x00007fb87705e585 in
pthread_rwlock_wrlock () from /lib64/libpthread.so.0
  3    Thread 0x7fb876491700 (LWP 28073) "rwl_g" 0x00007fb87705e12a in
pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  4    Thread 0x7fb875c90700 (LWP 28074) "rwl_g" 0x00007fb87705e585 in
pthread_rwlock_wrlock () from /lib64/libpthread.so.0
* 5    Thread 0x7fb87548f700 (LWP 28075) "rwl_g" 0x00007fb87705e63f in
pthread_rwlock_wrlock () from /lib64/libpthread.so.0

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/23844] pthread_rwlock_trywrlock results in hang

cvs-commit at gcc dot gnu.org
https://sourceware.org/bugzilla/show_bug.cgi?id=23844

Andreas Schwab <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unspecified                 |2.26

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/23844] pthread_rwlock_trywrlock results in hang

cvs-commit at gcc dot gnu.org
In reply to this post by cvs-commit at gcc dot gnu.org
https://sourceware.org/bugzilla/show_bug.cgi?id=23844

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-11-01
                 CC|                            |carlos at redhat dot com,
                   |                            |triegel at redhat dot com
     Ever confirmed|0                           |1

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
Thank you for the bug report and reduced test case.

At first I thought this might be "rwlock: Fix explicit hand-over (bug 21298)",
but we fixed that *in* 2.26, but I would like you to double check that you have
that fix in your sources e.g. commit faf8c066df0d6bccb54bd74dd696eeb65e1b3bbc.

I looked at your test case and it seems entirely reasonable.

It's interesting that the effect of using trylock + EBUSY checking basically
ensures that the locking and sleeping on the futex happen in a very narrow band
of code.

Interestingly enough with strace it works, ptrace seems to perturb the test
enough that it works. Likewise it works sometimes with gdb (also using ptrace),
but maybe 1 out of 4 times it fails. So it certainly looks like a race.

With readers doing the same thing, trylock + EBUSY->lock, it succeeds. So
perhaps it's only the write side that has a defect.

In the failure mode I see 3 threads in the write lock, and 1 thread in the read
lock. I haven't seen anything different. In all 4 threads all the threads have
entered the kernel via futex_wait and are stuck there waiting for some thread
to wake them up, and that will never happen.

I don't immediately see what's wrong, but I'd have to audit
__pthread_rwlock_wrlock_full and __pthread_rwlock_trywrlock before concluding
what's wrong.

I've asked Torvald Riegel to have look, he's the primary author of the new
rwlock implementation.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/23844] pthread_rwlock_trywrlock results in hang

cvs-commit at gcc dot gnu.org
In reply to this post by cvs-commit at gcc dot gnu.org
https://sourceware.org/bugzilla/show_bug.cgi?id=23844

--- Comment #2 from Michael Marxmeier <mike at marxmeier dot com> ---
Some additional notes

Unfortunately, the problem is not limited to this particular use of
trywrlock. I can also reproduce the same issue for readers, it takes
somewhat longer to reproduce but you end up in the same condition.
Also, a more concentional use of trywrlock (check availability of multiple
resources) results in the same hang.

As far as i can see this affects all glibc versions in recent distributions,
eg. Fedora 28/29, SUSE 15 or Ubuntu 1810. glibc versions before 2.26 seem
not affected, eg. CentOS 7.5.

--
You are receiving this mail because:
You are on the CC list for the bug.