[PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

katsuki.uwatoko
Hi,

I have found an issue in __lll_timedlock_wait on ARM.

The following sequence causes unnecessary busy loop.

"A thread" gets the lock. (futex = 1)
"B thread" tries to get the lock, and has not called futex syscall yet. (futex = 2)
"A thread" releases the lock (futex = 0)
"C thread" gets the lock, and does not call futex syscall because of futex=0 (futex = 1)
"B thread" calls futex syscall, and returns with an error.
           Because futex syscall in Linux Kernel checks if the val is changed
           from 2, which is the 3rd arg of the syscall at futex_wait_setup(),
           but in this case futex is 1.
"B thread" tries to get the lock in userspace but cannot get it
           because futex is not 0.
           After all the thread keeps calling futex syscall
           until "C thread" will release it (futex = 0) or timeout.

Therefore I think that the value should be set 2 in every loop
the same as __lll_lock_wait_private, and attached a patch for this issue.
--
UWATOKO Katsuki

---
 .../unix/sysv/linux/arm/nptl/lowlevellock.c        |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c b/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c
index 756f39f..a495d85 100644
--- a/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c
+++ b/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c
@@ -65,6 +65,7 @@ __lll_timedlock_wait (int *futex, const struct timespec *abstime, int private)
   do
     {
       struct timeval tv;
+      int oldval;
 
       /* Get the current time.  */
       (void) __gettimeofday (&tv, NULL);
@@ -82,8 +83,10 @@ __lll_timedlock_wait (int *futex, const struct timespec *abstime, int private)
       if (rt.tv_sec < 0)
  return ETIMEDOUT;
 
+      oldval = atomic_compare_and_exchange_val_acq (futex, 2, 1);
       // XYZ: Lost the lock to check whether it was private.
-      lll_futex_timed_wait (futex, 2, &rt, private);
+      if (oldval != 0)
+ lll_futex_timed_wait (futex, 2, &rt, private);
     }
   while (atomic_compare_and_exchange_bool_acq (futex, 2, 0) != 0);
 
--
1.7.4.1

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

Joseph Myers
On Thu, 31 Jan 2013, [hidden email] wrote:

> Hi,
>
> I have found an issue in __lll_timedlock_wait on ARM.

I think the busy loop should have a bug filed in Bugzilla, as a
user-visible bug - could you file that bug?

> The following sequence causes unnecessary busy loop.
>
> "A thread" gets the lock. (futex = 1)
> "B thread" tries to get the lock, and has not called futex syscall yet. (futex = 2)
> "A thread" releases the lock (futex = 0)
> "C thread" gets the lock, and does not call futex syscall because of futex=0 (futex = 1)
> "B thread" calls futex syscall, and returns with an error.
>            Because futex syscall in Linux Kernel checks if the val is changed
>            from 2, which is the 3rd arg of the syscall at futex_wait_setup(),
>            but in this case futex is 1.
> "B thread" tries to get the lock in userspace but cannot get it
>            because futex is not 0.
>            After all the thread keeps calling futex syscall
>            until "C thread" will release it (futex = 0) or timeout.
>
> Therefore I think that the value should be set 2 in every loop
> the same as __lll_lock_wait_private, and attached a patch for this issue.

Carlos, any comments on this patch
<http://sourceware.org/ml/libc-ports/2013-01/msg00084.html>?  It makes the
ARM version of __lll_timedlock_wait closer to the HPPA version, but I
don't know if many of the differences between different architecture
versions of this code are really deliberate....

Would you agree that the generic Linux version
(nptl/sysdeps/unix/sysv/linux/lowlevellock.c) doesn't need such a change
because the loop is using atomic_exchange_acq rather than
atomic_compare_and_exchange_bool_acq, so is already setting the futex to 2
in every loop iteration?

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

katsuki.uwatoko

> > I have found an issue in __lll_timedlock_wait on ARM.
>
> I think the busy loop should have a bug filed in Bugzilla, as a
> user-visible bug - could you file that bug?

Thank you for the reply. I have filed this in Bugzilla.

http://sourceware.org/bugzilla/show_bug.cgi?id=15119
--
UWATOKO Katsuki
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

Carlos O'Donell-2
In reply to this post by Joseph Myers
Dan,

If you wouldn't mind I'd like your historical perspective on a glibc issue.

You were the person who last touched
ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c; do you have any
recollection why ARM needed a custom copy of this file?

Dave,

I see that sparc32 also has a unique copy of lowlevellock.c Why the
use of *_24_* atomic primitives? Faster?

Andreas, Thomas, Kaz, Marcus,

I've added you to the TO: because machines you maintain are affected
by a bug in their lowlevellock.h implementations. Please see section
(c) below and review my analysis.

> Carlos, any comments on this patch
> <http://sourceware.org/ml/libc-ports/2013-01/msg00084.html>?  It makes the
> ARM version of __lll_timedlock_wait closer to the HPPA version, but I
> don't know if many of the differences between different architecture
> versions of this code are really deliberate....

I do not believe HPPA needs a unique version of lowlevellock.c and
should instead be using the generic version. I do not recollect why I
would have created a copy of lowlevellock.c other than I was mostly
cloning SPARC when doing binutils and glibc work (Dave does a good
job). The HPPA kernel helper emulates compare-and-exchange and thus
everything that is required for a generic NPTL implementation.

> Would you agree that the generic Linux version
> (nptl/sysdeps/unix/sysv/linux/lowlevellock.c) doesn't need such a change
> because the loop is using atomic_exchange_acq rather than
> atomic_compare_and_exchange_bool_acq, so is already setting the futex to 2
> in every loop iteration?

(a) Review of the reporter's claim that a busy loop can result in the
ARM version of __lll_timedlock_wait.

I have reviewed the ARM function __lll_timedlock_wait and I can
confirm that it does indeed seem like the current code can cause a
busy-wait loop with three threads given the conditions provided by
Uwatoko-san. I see nothing unusual about the given conditions and with
enough cores you should be able to trigger the busy-wait loop.

(b) Review of the generic version of __lll_timedlock_wait.

I can confirm that the generic version of __lll_timedlock_wait is
*not* susceptible to a busy-wait loop. The use of `atomic_exchange_acq
(futex, 2)' in the while loop condition ensures that the code always
correctly *assumes* the lock is locked with more than one waiter. Note
that the expectation is that __lll_timedlock *must not* reset futex to
1, otherwise new threads may prevent us from waiting on a futex value
of 2. The expectation is that __lll_timedlock sets futex to 1 only if
it was zero.

(c) Solution to the busy wait loop.

I do not like the proposed solution. It is a band-aid on top of a
function that doesn't look like it has any architectural merits. I
think ARM should be using the generic NPTL version of lowlevellock.c.

The real problem is that ARM's __lll_timedlock does a
"atomic_exchange_acq (__futex, 1)" while attempting to acquire the
futex (violating what I noted in (b)). This means that any new thread
trying to acquire the futex will reset the futex to "locked no
waiters" and this causes all other inflight threads doing a futex wait
to fail with -EWOUDLBLOCK.

It seems to me that this is an ARM bug in __lll_timedlock macro in
lowlevelock.h that was partly worked around in __lll_timedlock_wait in
lowlevellock.c.

For example compare ARM's implementation:
#define __lll_timedlock(futex, abstime, private)                              \
  ({                                                                          \
     int *__futex = (futex);                                                  \
     int __val = 0;                                                           \
                                                                              \
     if (__builtin_expect (atomic_exchange_acq (__futex, 1), 0))              \
       __val = __lll_timedlock_wait (__futex, abstime, private);              \
     __val;                                                                   \
  })

To SPARC's implementation:
static inline int
__attribute__ ((always_inline))
__lll_timedlock (int *futex, const struct timespec *abstime, int private)
{
  int val = atomic_compare_and_exchange_val_24_acq (futex, 1, 0);
  int result = 0;

  if (__builtin_expect (val != 0, 0))
    result = __lll_timedlock_wait (futex, abstime, private);
  return result;
}

Note that the SPARC implementation sets the futex to 1 only if it was
zero, otherwise it calls into __lll_timedlock_wait, which will then
set futex to 2, because we can't know at that point that there aren't
one or more waiters (the "perhaps" scenario if you've read Ulrich's
paper "Futexes Are Tricky").

I have reviewed all of the lowlevellock.h implementations and find the
following results:

No chance of busy-wait:
* mips, ia64, alpha, tile, i386, powerpc, x86-64, s390, sparc, hppa
- All of these targets use compare-and-exchange in __lll_timedlock
without resetting futex to 1.

Possible chance of busy-wait:
* m68k, aarch64, arm, sh/sh4,
- All of these targets reset futex to 1 using atomic-exchange in
__lll_timedlock.

m68k and sh/sh4 use the generic version of lowlevellock.c, and this
means the busy-wait loop bug is not present but the use of
atomic-exchange causes a different bug. On m68k and sh/sh4 you get at
most N spurious failed futex wait syscalls for each thread calling
__lll_timedwait where N is the number of inflight threads expecting
futex == 2.

For example on m68k:

* T1 calls __lll_timedlock setting futex to 1 and taking the lock.
* T2 calls __lll_timedlock setting futex to 1 but does not take the lock.
* T2 calls __lll_timedlock_wait and sets the futex to 2 and does not
gain the lock.
* T3 calls __lll_timedlock setting futex to 1 but does not take the lock.
* T2 calls futex wait but fails with -EWOULDBLOCK because T3 reset futex to 1.
-> One inflight thread (T2), and one spurious failed futex wait syscall.
* T2 again sets the futex to 2 and does not gain the lock.
* ... T2 and T3 go on to call futex wait syscall and both sleep.

In the ARM case with the current implementation of __lll_timedwait we
get a busy-loop as outlined by Owatoko-san.

The real fix is to use atomic-compare-and-swap in __lll_timedlock.

In fact the use of atomic-exchange in ARM's __lll_timedlock seems
dangerous given this unlock:
#define __lll_unlock(futex, private) \
  (void)                                                        \
    ({ int *__futex = (futex);                                  \
       int __oldval = atomic_exchange_rel (__futex, 0);         \
       if (__builtin_expect (__oldval > 1, 0))                  \
         lll_futex_wake (__futex, 1, private);                  \
    })

Thus you *could* have a new thread calling __lll_timedlock, setting
futex to 1, and every thread calling __lll_unlock skips the wake. This
could cause a *huge* delay between sleeping and waking as the wake is
deferred later and later. Eventually one of the threads does an atomic
operation on the futex to set the value to 2 and will take the lock,
and eventually call wake. However, the latency is increased and that's
a performance bug.

In summary:
- All machines listed above as `Possible chance of busy-wait' should
be using a generic __lll_timedlock that looks like this:
~~~
  int result = 0;
  if (atomic_compare_and_exchange_bool_acq (futex, 1, 0) != 0)
    result = __lll_timedlock_wait (futex, abstime, private);
  return result;
~~~
(simplest example taken from alpha's lowlevellock.h)

Comments?

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

David Miller-13
From: "Carlos O'Donell" <[hidden email]>
Date: Fri, 8 Feb 2013 23:18:42 -0500

> I see that sparc32 also has a unique copy of lowlevellock.c Why the
> use of *_24_* atomic primitives? Faster?

On pre-v9 32-bit sparc, we lack any usable atomic compare and
swap.

All we have is an 8-bit spinlock.

So we implement things in a 32-bit word which is composed of a 24-bit
counter and an 8-bit lock.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

Carlos O'Donell-6
On 02/09/2013 01:49 AM, David Miller wrote:

> From: "Carlos O'Donell" <[hidden email]>
> Date: Fri, 8 Feb 2013 23:18:42 -0500
>
>> I see that sparc32 also has a unique copy of lowlevellock.c Why the
>> use of *_24_* atomic primitives? Faster?
>
> On pre-v9 32-bit sparc, we lack any usable atomic compare and
> swap.
>
> All we have is an 8-bit spinlock.
>
> So we implement things in a 32-bit word which is composed of a 24-bit
> counter and an 8-bit lock.
 
Thus a futex on sparc looks like this?

struct futex {
  union {
    int whole;
    struct {
      char lock;
      char value[3];
    } __split;
  } __futex;
};

With only 24-bits for the value?

I'll have to remember pre-v9 sparc only has 24-bits there.

I'd seen the *other* sparc pre-v9 implementation that used 64 global
locks per-library and that seemed signal unsafe and prone to deadlocks.

How do you deal with the FUTEX_WAITERS/FUTEX_OWNER_DIED bits that
are set in the high bits of the word?

Or a tid that is 26-bits (FUTEX_TID_MASK)?

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

David Miller-13
From: "Carlos O'Donell" <[hidden email]>
Date: Sat, 09 Feb 2013 09:55:43 -0500

> I'd seen the *other* sparc pre-v9 implementation that used 64 global
> locks per-library and that seemed signal unsafe and prone to deadlocks.

All of these pre-v9 things are signal unsafe and deadlock.

I thought about doing the kernel atomic emulation other platforms have
adopted, but frankly these cpus are so old and deprecated that they're
not worth doing the work for.

And by the time we'd propagate all of this infrastructure necessary to
support this kind of scheme, those cpus would be even more outdated.

Even debian does v9-only build on 32-bit.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

Carlos O'Donell-2
On Sat, Feb 9, 2013 at 11:56 PM, David Miller <[hidden email]> wrote:

> From: "Carlos O'Donell" <[hidden email]>
> Date: Sat, 09 Feb 2013 09:55:43 -0500
>
>> I'd seen the *other* sparc pre-v9 implementation that used 64 global
>> locks per-library and that seemed signal unsafe and prone to deadlocks.
>
> All of these pre-v9 things are signal unsafe and deadlock.
>
> I thought about doing the kernel atomic emulation other platforms have
> adopted, but frankly these cpus are so old and deprecated that they're
> not worth doing the work for.
>
> And by the time we'd propagate all of this infrastructure necessary to
> support this kind of scheme, those cpus would be even more outdated.
>
> Even debian does v9-only build on 32-bit.

Eminently practical. Just curious. Thanks for verifying what I suspected.

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

Daniel Jacobowitz-2
It's been too long, sorry.  It may have been necessary solely to
provide the separate EABI and NPTL versions in sysdeps; you'd have to
look at e.g. the sysdeps selection order for the LinuxThreads version.
 It may also be related to the lack of usable atomic primitives,
pre-EABI.

On Sun, Feb 10, 2013 at 12:54 PM, Carlos O'Donell
<[hidden email]> wrote:

> On Sat, Feb 9, 2013 at 11:56 PM, David Miller <[hidden email]> wrote:
>> From: "Carlos O'Donell" <[hidden email]>
>> Date: Sat, 09 Feb 2013 09:55:43 -0500
>>
>>> I'd seen the *other* sparc pre-v9 implementation that used 64 global
>>> locks per-library and that seemed signal unsafe and prone to deadlocks.
>>
>> All of these pre-v9 things are signal unsafe and deadlock.
>>
>> I thought about doing the kernel atomic emulation other platforms have
>> adopted, but frankly these cpus are so old and deprecated that they're
>> not worth doing the work for.
>>
>> And by the time we'd propagate all of this infrastructure necessary to
>> support this kind of scheme, those cpus would be even more outdated.
>>
>> Even debian does v9-only build on 32-bit.
>
> Eminently practical. Just curious. Thanks for verifying what I suspected.
>
> Cheers,
> Carlos.



--
Thanks,
Daniel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

Daniel Jacobowitz-2
You could try asking Richard Earnshaw...

On Tue, Feb 12, 2013 at 4:41 PM, Daniel Jacobowitz <[hidden email]> wrote:

> It's been too long, sorry.  It may have been necessary solely to
> provide the separate EABI and NPTL versions in sysdeps; you'd have to
> look at e.g. the sysdeps selection order for the LinuxThreads version.
>  It may also be related to the lack of usable atomic primitives,
> pre-EABI.
>
> On Sun, Feb 10, 2013 at 12:54 PM, Carlos O'Donell
> <[hidden email]> wrote:
>> On Sat, Feb 9, 2013 at 11:56 PM, David Miller <[hidden email]> wrote:
>>> From: "Carlos O'Donell" <[hidden email]>
>>> Date: Sat, 09 Feb 2013 09:55:43 -0500
>>>
>>>> I'd seen the *other* sparc pre-v9 implementation that used 64 global
>>>> locks per-library and that seemed signal unsafe and prone to deadlocks.
>>>
>>> All of these pre-v9 things are signal unsafe and deadlock.
>>>
>>> I thought about doing the kernel atomic emulation other platforms have
>>> adopted, but frankly these cpus are so old and deprecated that they're
>>> not worth doing the work for.
>>>
>>> And by the time we'd propagate all of this infrastructure necessary to
>>> support this kind of scheme, those cpus would be even more outdated.
>>>
>>> Even debian does v9-only build on 32-bit.
>>
>> Eminently practical. Just curious. Thanks for verifying what I suspected.
>>
>> Cheers,
>> Carlos.
>
>
>
> --
> Thanks,
> Daniel



--
Thanks,
Daniel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

Carlos O'Donell-2
In reply to this post by Daniel Jacobowitz-2
On Tue, Feb 12, 2013 at 4:41 PM, Daniel Jacobowitz <[hidden email]> wrote:
> It's been too long, sorry.  It may have been necessary solely to
> provide the separate EABI and NPTL versions in sysdeps; you'd have to
> look at e.g. the sysdeps selection order for the LinuxThreads version.
>  It may also be related to the lack of usable atomic primitives,
> pre-EABI.

Thanks! I knew it was a long shot, but sometimes people remember the
oddest things for the longest time :-)

Cheers,
Carlos.