[Bug linuxthreads/25774] New: Suggestion: count number of retries in __lll_lock_wait

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug linuxthreads/25774] New: Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

            Bug ID: 25774
           Summary: Suggestion: count number of retries in __lll_lock_wait
           Product: glibc
           Version: 2.30
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P2
         Component: linuxthreads
          Assignee: unassigned at sourceware dot org
          Reporter: rigault.francois+glib at gmail dot com
                CC: drow at false dot org
  Target Milestone: ---

There is no guarantee that threads trying to concurrently access the same
mutex, acquire it in a fair way.

If you consider a c++ program running multiple of these threads:


```
#include <mutex>
#include <chrono>
#include <iostream>
#include <thread>
using namespace std;
using namespace std::chrono;

mutex m;

void func(string s) {
        while(true) {
                lock_guard<mutex> l(m);
                cout << s << endl;
                this_thread::sleep_for(seconds(1));
        }
}
int main() {

        thread one(func, "one");
        thread two(func, "two");

        one.join();
        return 0;
}
```

here the first thread acquiring the mutex will keep it for one second.
When it releases the mutex, it wakes up the other thread.

However, practically, there is no guarantee that the other thread successfully
acquires it. Both threads are still trying to concurrently acquire the mutex.
In practice, the same thread gets to acquire the mutex again and again. The
second thread will seem stuck and will never acquire the mutex.

In order to ease the investigation of these kinds of stuck threads, what do you
think of improving the current stap probe point in this fashion:

```
--- glibc-2.30/nptl/lowlevellock.c      2019-08-01 06:29:31.000000000 +0200
+++ ../glibc/glibc-2.30/nptl/lowlevellock.c     2020-04-04 11:24:50.367896172
+0200
@@ -42,13 +42,16 @@
 void
 __lll_lock_wait (int *futex, int private)
 {
+  int count = 0;
+
   if (atomic_load_relaxed (futex) == 2)
     goto futex;

   while (atomic_exchange_acquire (futex, 2) != 0)
     {
+      ++count;
     futex:
-      LIBC_PROBE (lll_lock_wait, 1, futex);
+      LIBC_PROBE (lll_lock_wait, 2, futex, count);
       lll_futex_wait (futex, 2, private); /* Wait if *futex == 2.  */
     }
 }
```

this would allow a system tap script to detect how many times a stuck thread
has been woken up without managing to acquire the lock.

It seems pretty easy to fall into this caveat when developing a multi threaded
application, I don't know if there is any other mean to investigate this.
Feedback is very welcome.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug linuxthreads/25774] Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

Romain Geissler <romain.geissler at amadeus dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |romain.geissler at amadeus dot com

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug linuxthreads/25774] Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
                 CC|                            |carlos at redhat dot com
   Last reconfirmed|                            |2020-04-04
     Ever confirmed|0                           |1

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to François Rigault from comment #0)

> ```
> --- glibc-2.30/nptl/lowlevellock.c      2019-08-01 06:29:31.000000000 +0200
> +++ ../glibc/glibc-2.30/nptl/lowlevellock.c     2020-04-04
> 11:24:50.367896172 +0200
> @@ -42,13 +42,16 @@
>  void
>  __lll_lock_wait (int *futex, int private)
>  {
> +  int count = 0;
> +
>    if (atomic_load_relaxed (futex) == 2)
>      goto futex;
>
>    while (atomic_exchange_acquire (futex, 2) != 0)
>      {
> +      ++count;
>      futex:
> -      LIBC_PROBE (lll_lock_wait, 1, futex);
> +      LIBC_PROBE (lll_lock_wait, 2, futex, count);
>        lll_futex_wait (futex, 2, private); /* Wait if *futex == 2.  */
>      }
>  }
> ```

It adds a small amount of code in the slow path before the wait, which seems
fine to me, we're about to enter the kernel and sleep.

The value is local, and should be 'unsigned' to allow wrapping.

I think your suggestion is a good one.

Would you like to put a patch together for this?

I'd expect:
- The change to the probe.
- Add the probe to manual/probes.texi under a new node "POSIX Thread Probes"
and document the behaviour (previously undocumented) of just this probe.

Have a look at our contribution checklist:
https://sourceware.org/glibc/wiki/Contribution%20checklist

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug linuxthreads/25774] Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

--- Comment #2 from François Rigault <rigault.francois+glib at gmail dot com> ---
Thanks for the feedback! I will try to read all that.

In the meantime, I was scratching my head to make this work with an older glibc
2.23. It turns out that in glibc 2.23, there was already a second argument to
the probe, that used to dump the "futex_op" argument of the futex syscall

```
1:     LIBC_PROBE (lll_lock_wait, 2, %rdi, %rsi)
```

so, depending on the version of glibc you will find either a probe with a
single argument, either a probe which second argument is either futex_op, or a
counter of how many times the thread was woken up uselessly. Hope it's not a
big deal.

Attaching what it looks like in glibc 2.23, purely fyi

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug linuxthreads/25774] Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

--- Comment #3 from François Rigault <rigault.francois+glib at gmail dot com> ---
Created attachment 12433
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12433&action=edit
fyi changing the probe in glibc 2.23

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/25774] Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-
                 CC|                            |drepper.fsp at gmail dot com,
                   |                            |fweimer at redhat dot com
          Component|linuxthreads                |nptl

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/25774] Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to François Rigault from comment #3)
> Created attachment 12433 [details]
> fyi changing the probe in glibc 2.23

There is no lowlevellock.S anymore in upstream. Please work on master.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug nptl/25774] Suggestion: count number of retries in __lll_lock_wait

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25774

--- Comment #5 from François Rigault <rigault.francois+glib at gmail dot com> ---
Created attachment 12451
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12451&action=edit
Proposed patch

Proposed patch. Can you please help me review it?

This is a tiny change that should not be legally significant as I understand.

--
You are receiving this mail because:
You are on the CC list for the bug.