[PATCH] Unify pthread_once (bug 15215)

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background.

I1 and I2 follow essentially the same algorithm, and we can replace it
with a unified variant, as the bug suggests.  See the attached patch for
a modified version of the sparc instance.  The differences between both
are either cosmetic, or are unnecessary changes (ie, how the
init-finished state is set (atomic_inc vs. store), or how the fork
generations are compared).

Both I1 and I2 were missing a release memory order (MO) when marking
once_control as finished initialization.  If the particular arch doesn't
need a HW barrier for release, we at least need a compiler barrier; if
it's needed, the original I1 and I2 are not guaranteed to work.

Both I1 and I2 were missing acquire MO on the very first load of
once_control.  This needs to synchronize with the release MO on setting
the state to init-finished, so without it it's not guaranteed to work
either.
Note that this will make a call to pthread_once that doesn't need to
actually run the init routine slightly slower due to the additional
acquire barrier.  If you're really concerned about this overhead, speak
up.  There are ways to avoid it, but it comes with additional complexity
and bookkeeping.
I'm currently also using the existing atomic_{read/write}_barrier
functions instead of not-yet-existing load_acq or store_rel functions.
I'm not sure whether the latter can have somewhat more efficient
implementations on Power and ARM; if so, and if you're concerned about
the overhead, we can add load_acq and store_rel to atomic.h and start
using it.  This would be in line with C11, where we should eventually be
heading to anyways, IMO.

Both I1 and I2 have an ABA issue on __fork_generation, as explained in
the comments that the patch adds.  How do you all feel about this?
I can't present a simple fix right now, but I believe it could be fixed
with additional bookkeeping.

If there's no objection to the essence of this patch, I'll post another
patch that actually replaces I1 and I2 with the modified variant in the
attached patch.

Cleaning up the magic numbers, perhaps fixing the ABA issue, and
comparing to the custom asm versions would be next.  I had a brief look
at the latter, and at least x86 doesn't seem to do anything logically
different.

Torvald
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Rich Felker
On Wed, May 08, 2013 at 04:43:57PM +0200, Torvald Riegel wrote:
> Note that this will make a call to pthread_once that doesn't need to
> actually run the init routine slightly slower due to the additional
> acquire barrier.  If you're really concerned about this overhead, speak
> up.  There are ways to avoid it, but it comes with additional complexity
> and bookkeeping.

On the one hand, I think it should be avoided if at all possible.
pthread_once is the correct, canonical way to do initialization (as
opposed to hacks like library init functions or global ctors), and the
main doubt lots of people have about doing it the correct way is that
they're going to kill performance if they call pthread_once from every
point where initialization needs to have been completed. If every call
imposes memory synchronization, performance might become a real issue
discouraging people from following best practices for library
initialization.

On the other hand, I don't think it's conforming to elide the barrier.
POSIX states (XSH 4.11 Memory Synchronization):

"The pthread_once() function shall synchronize memory for the first
call in each thread for a given pthread_once_t object."

Since it's impossible to track whether a call is the first call in a
given thread, this means every call to pthread_once() is required to
be a full memory barrier. I suspect this is unintended, and we should
perhaps file a bug report with the Austin Group and see if the
requirement can be relaxed.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
On Wed, 2013-05-08 at 13:51 -0400, Rich Felker wrote:

> On Wed, May 08, 2013 at 04:43:57PM +0200, Torvald Riegel wrote:
> > Note that this will make a call to pthread_once that doesn't need to
> > actually run the init routine slightly slower due to the additional
> > acquire barrier.  If you're really concerned about this overhead, speak
> > up.  There are ways to avoid it, but it comes with additional complexity
> > and bookkeeping.
>
> On the one hand, I think it should be avoided if at all possible.
> pthread_once is the correct, canonical way to do initialization (as
> opposed to hacks like library init functions or global ctors), and the
> main doubt lots of people have about doing it the correct way is that
> they're going to kill performance if they call pthread_once from every
> point where initialization needs to have been completed. If every call
> imposes memory synchronization, performance might become a real issue
> discouraging people from following best practices for library
> initialization.

Well, what we precisely need is that the initialization happens-before
(ie, the relation from the, say, C11 memory model) every call that does
not in fact initialize.  If initialization happened on another thread,
you need to synchronize.  But from there on, you are essentially free to
establish this in any way you want.  And there are ways, because
happens-before is more-or-less transitive.

> On the other hand, I don't think it's conforming to elide the barrier.
> POSIX states (XSH 4.11 Memory Synchronization):
>
> "The pthread_once() function shall synchronize memory for the first
> call in each thread for a given pthread_once_t object."

No, it's not.  You could see just parts of the effects of the
initialization; potentially reading garbage can't be the intended
semantics :)

> Since it's impossible to track whether a call is the first call in a
> given thread

Are you sure about this? :)

> this means every call to pthread_once() is required to
> be a full memory barrier.

Note that we do not need a full memory barrier, just an acquire memory
barrier.  So this only matters on architectures with memory models that
give weaker per-default ordering guarantees.  For example, this doesn't
add any hardware barrier instructions on x86 or Sparc TSO.  But for
Power and ARM it does.

> I suspect this is unintended, and we should
> perhaps file a bug report with the Austin Group and see if the
> requirement can be relaxed.

I don't think that other semantics are intended.  If you return from
pthread_once(), initialization should have happened before that.  If it
doesn't, you don't really know whether initialization happened once, so
programs would be forced to do their own synchronization.


Torvald

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Rich Felker
On Wed, May 08, 2013 at 10:47:26PM +0200, Torvald Riegel wrote:

> On Wed, 2013-05-08 at 13:51 -0400, Rich Felker wrote:
> > On Wed, May 08, 2013 at 04:43:57PM +0200, Torvald Riegel wrote:
> > > Note that this will make a call to pthread_once that doesn't need to
> > > actually run the init routine slightly slower due to the additional
> > > acquire barrier.  If you're really concerned about this overhead, speak
> > > up.  There are ways to avoid it, but it comes with additional complexity
> > > and bookkeeping.
> >
> > On the one hand, I think it should be avoided if at all possible.
> > pthread_once is the correct, canonical way to do initialization (as
> > opposed to hacks like library init functions or global ctors), and the
> > main doubt lots of people have about doing it the correct way is that
> > they're going to kill performance if they call pthread_once from every
> > point where initialization needs to have been completed. If every call
> > imposes memory synchronization, performance might become a real issue
> > discouraging people from following best practices for library
> > initialization.
>
> Well, what we precisely need is that the initialization happens-before
> (ie, the relation from the, say, C11 memory model) every call that does
> not in fact initialize.  If initialization happened on another thread,
> you need to synchronize.  But from there on, you are essentially free to
> establish this in any way you want.  And there are ways, because
> happens-before is more-or-less transitive.
>
> > On the other hand, I don't think it's conforming to elide the barrier.
> > POSIX states (XSH 4.11 Memory Synchronization):
> >
> > "The pthread_once() function shall synchronize memory for the first
> > call in each thread for a given pthread_once_t object."
>
> No, it's not.  You could see just parts of the effects of the
> initialization; potentially reading garbage can't be the intended
> semantics :)

The work of synchronizing memory should take place at the end of the
pthread_once call that actually does the initialization, rather than
in the other threads which synchronize. This is the way the x86 memory
model naturally works, but perhaps it's prohibitive to achieve on
other architectures. However, the idea is that pthread_once only runs
init routines a small finite number of times, so even if you had to so
some horrible hack that makes the synchronization on return 1000x
slower (e.g. a syscall), it would still be better than incurring the
cost of a full acquire barrier in each subsequent call, which ideally
should have the same cost as a call to an empty function.

> > Since it's impossible to track whether a call is the first call in a
> > given thread
>
> Are you sure about this? :)

It's impossible with bounded memory requirements, and thus impossible
in general (allocating memory for the tracking might fail).

> > this means every call to pthread_once() is required to
> > be a full memory barrier.
>
> Note that we do not need a full memory barrier, just an acquire memory
> barrier.  So this only matters on architectures with memory models that
> give weaker per-default ordering guarantees.  For example, this doesn't
> add any hardware barrier instructions on x86 or Sparc TSO.  But for
> Power and ARM it does.

Yes, I see that.

> > I suspect this is unintended, and we should
> > perhaps file a bug report with the Austin Group and see if the
> > requirement can be relaxed.
>
> I don't think that other semantics are intended.  If you return from
> pthread_once(), initialization should have happened before that.  If it
> doesn't, you don't really know whether initialization happened once, so
> programs would be forced to do their own synchronization.

I think my confusion is merely that POSIX does not define the phrase
"synchronize memory", and in the absence of a definition, "full memory
barrier" (both release and acquire semantics) is the only reasonable
interpretation I can find. In other words, it seems like a
pathological conforming program could attempt to use the language in
the specification to use pthread_once as a release barrier. I'm not
sure if there are ways this could be meaningfully arranged (i.e. with
well-defined ordering; off-hand, I would think tricks with cancelling
an in-progress invocation of pthread_once might make it possible.

By the way, cancellation probably makes the above POSIX text incorrect
anyway; a thread could call pthread_once on the same pthread_once_t
object more than once, with the second call not being a no-op, if the
initialization routine for the first call is cancelled and the second
call takes place from a cancellation cleanup handler.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
On Wed, 2013-05-08 at 17:25 -0400, Rich Felker wrote:

> On Wed, May 08, 2013 at 10:47:26PM +0200, Torvald Riegel wrote:
> > On Wed, 2013-05-08 at 13:51 -0400, Rich Felker wrote:
> > > On Wed, May 08, 2013 at 04:43:57PM +0200, Torvald Riegel wrote:
> > > > Note that this will make a call to pthread_once that doesn't need to
> > > > actually run the init routine slightly slower due to the additional
> > > > acquire barrier.  If you're really concerned about this overhead, speak
> > > > up.  There are ways to avoid it, but it comes with additional complexity
> > > > and bookkeeping.
> > >
> > > On the one hand, I think it should be avoided if at all possible.
> > > pthread_once is the correct, canonical way to do initialization (as
> > > opposed to hacks like library init functions or global ctors), and the
> > > main doubt lots of people have about doing it the correct way is that
> > > they're going to kill performance if they call pthread_once from every
> > > point where initialization needs to have been completed. If every call
> > > imposes memory synchronization, performance might become a real issue
> > > discouraging people from following best practices for library
> > > initialization.
> >
> > Well, what we precisely need is that the initialization happens-before
> > (ie, the relation from the, say, C11 memory model) every call that does
> > not in fact initialize.  If initialization happened on another thread,
> > you need to synchronize.  But from there on, you are essentially free to
> > establish this in any way you want.  And there are ways, because
> > happens-before is more-or-less transitive.
> >
> > > On the other hand, I don't think it's conforming to elide the barrier.
> > > POSIX states (XSH 4.11 Memory Synchronization):
> > >
> > > "The pthread_once() function shall synchronize memory for the first
> > > call in each thread for a given pthread_once_t object."
> >
> > No, it's not.  You could see just parts of the effects of the
> > initialization; potentially reading garbage can't be the intended
> > semantics :)
>
> The work of synchronizing memory should take place at the end of the
> pthread_once call that actually does the initialization, rather than
> in the other threads which synchronize.

This isn't how the (hardware) memory models work.  And it makes sense;
if one CPU could prevent reordering in other CPUs (which would be
required for what you have in mind), this would be an unconditional big
hammer.  Instead, CPUs can opt in by issuing barriers when needed, which
then prevent reordering wrt. what happens globally to memory.

> This is the way the x86 memory
> model naturally works, but perhaps it's prohibitive to achieve on
> other architectures.

The x86 memory model is just stronger than others, so certain
reorderings don't appear or aren't visible to programs.  IOW, you don't
need to do certain things explicitly for the hardware.  You still do
need the appropriate compiler barriers though; for example, if the
compiler reorders the once_control release store to before the
initialization stores, you still have an incorrectly synchronized
program, even on x86.

More background on this can be found in the C11 and C++11 memory models,
in the Batty et al. paper formalizing C++11's.  This list of mappings
from these language-level models to HW could also be interesting (note
that it doesn't cover the compiler side of this explicitly):
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

> However, the idea is that pthread_once only runs
> init routines a small finite number of times, so even if you had to so
> some horrible hack that makes the synchronization on return 1000x
> slower (e.g. a syscall), it would still be better than incurring the
> cost of a full acquire barrier in each subsequent call, which ideally
> should have the same cost as a call to an empty function.

That would be true if non-first calls appear
1000*(syscall_overhead/acquire_mbar_overhead) times.  But do they?

I think the way forward here is to:
1) Fix the implementation (ie, add the mbars).
2) Let the arch maintainers of the affected archs with weak memory moels
(or people interested in this) look at this and come up with some
measurements for how much overhead the mbars actually present in real
code.
3) Decide whether this overhead justifies adding optimizations.

This patch is step 1.  I don't think we need to merge this step 3.

> > > Since it's impossible to track whether a call is the first call in a
> > > given thread
> >
> > Are you sure about this? :)
>
> It's impossible with bounded memory requirements, and thus impossible
> in general (allocating memory for the tracking might fail).

I believe you think about needing to track more than you actually need
to know.  All you need is knowing whether a thread established a
happens-before with whoever initialized the once_control in the past.
So you do need per-thread state, and per-once_control state, but not
necessarily more.  If in doubt, you can still do the acquire barrier.

> > > this means every call to pthread_once() is required to
> > > be a full memory barrier.
> >
> > Note that we do not need a full memory barrier, just an acquire memory
> > barrier.  So this only matters on architectures with memory models that
> > give weaker per-default ordering guarantees.  For example, this doesn't
> > add any hardware barrier instructions on x86 or Sparc TSO.  But for
> > Power and ARM it does.
>
> Yes, I see that.
>
> > > I suspect this is unintended, and we should
> > > perhaps file a bug report with the Austin Group and see if the
> > > requirement can be relaxed.
> >
> > I don't think that other semantics are intended.  If you return from
> > pthread_once(), initialization should have happened before that.  If it
> > doesn't, you don't really know whether initialization happened once, so
> > programs would be forced to do their own synchronization.
>
> I think my confusion is merely that POSIX does not define the phrase
> "synchronize memory", and in the absence of a definition, "full memory
> barrier" (both release and acquire semantics) is the only reasonable
> interpretation I can find. In other words, it seems like a
> pathological conforming program could attempt to use the language in
> the specification to use pthread_once as a release barrier. I'm not
> sure if there are ways this could be meaningfully arranged (i.e. with
> well-defined ordering; off-hand, I would think tricks with cancelling
> an in-progress invocation of pthread_once might make it possible.

I agree that the absence of a proper memory model makes reasoning about
some of this hard.  I guess it would be best if POSIX would just endorse
C11's memory model, and specify the intended semantics in relation to
this model where needed.

For example, the C11 variant of pthread_once has the following
requirement:
"Completion of an effective call to the call_once function synchronizes
with all subsequent calls to the call_once function with the same value
of flag."

This makes intuitive sense, and is what's enforced by the patch I sent.
("synchronizes with" is a well-defined relationship in the model, and
contributes to happens-before.)

Torvald

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Rich Felker
On Thu, May 09, 2013 at 10:39:25AM +0200, Torvald Riegel wrote:
> > However, the idea is that pthread_once only runs
> > init routines a small finite number of times, so even if you had to so
> > some horrible hack that makes the synchronization on return 1000x
> > slower (e.g. a syscall), it would still be better than incurring the
> > cost of a full acquire barrier in each subsequent call, which ideally
> > should have the same cost as a call to an empty function.
>
> That would be true if non-first calls appear
> 1000*(syscall_overhead/acquire_mbar_overhead) times.  But do they?

In theory they might. Imagine a math function that might be called
millions or billions of times, but which depends on a precomputed
table. Personally, my view of best-practices is that you should use
'static const' for such tables, even if they're huge, rather than
runtime generation, but unfortunately I think my view is still a
minority one...

Also, keep in mind that even large overhead on the first call to
pthread_once is likely to be small in comparison to the time spent in
the initialization function, while even small overhead is huge in
comparison to a call to pthread_once that doesn't call the
initialization function.

> I think the way forward here is to:
> 1) Fix the implementation (ie, add the mbars).
> 2) Let the arch maintainers of the affected archs with weak memory moels
> (or people interested in this) look at this and come up with some
> measurements for how much overhead the mbars actually present in real
> code.
> 3) Decide whether this overhead justifies adding optimizations.
>
> This patch is step 1.  I don't think we need to merge this step 3.

I think this is a reasonable approach.

> > > > Since it's impossible to track whether a call is the first call in a
> > > > given thread
> > >
> > > Are you sure about this? :)
> >
> > It's impossible with bounded memory requirements, and thus impossible
> > in general (allocating memory for the tracking might fail).
>
> I believe you think about needing to track more than you actually need
> to know.  All you need is knowing whether a thread established a
> happens-before with whoever initialized the once_control in the past.
> So you do need per-thread state, and per-once_control state, but not
> necessarily more.  If in doubt, you can still do the acquire barrier.

The number of threads and the number of once controls are both
unbounded. You might could solve the problem with serial numbers if
there were room to store a sufficiently large one in the once control,
but the once control is 32-bit and the serial numbers could (in a
pathological but valid application) easily overflow 32 bits.

> > I think my confusion is merely that POSIX does not define the phrase
> > "synchronize memory", and in the absence of a definition, "full memory
> > barrier" (both release and acquire semantics) is the only reasonable
> > interpretation I can find. In other words, it seems like a
> > pathological conforming program could attempt to use the language in
> > the specification to use pthread_once as a release barrier. I'm not
> > sure if there are ways this could be meaningfully arranged (i.e. with
> > well-defined ordering; off-hand, I would think tricks with cancelling
> > an in-progress invocation of pthread_once might make it possible.
>
> I agree that the absence of a proper memory model makes reasoning about
> some of this hard.  I guess it would be best if POSIX would just endorse
> C11's memory model, and specify the intended semantics in relation to
> this model where needed.

Agreed, and I suspect this is what they'll do. I can raise the issue,
but perhaps you'd be better at expressing it. Let me know if you'd
rather I do it.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
On Thu, 2013-05-09 at 10:02 -0400, Rich Felker wrote:

> On Thu, May 09, 2013 at 10:39:25AM +0200, Torvald Riegel wrote:
> > > However, the idea is that pthread_once only runs
> > > init routines a small finite number of times, so even if you had to so
> > > some horrible hack that makes the synchronization on return 1000x
> > > slower (e.g. a syscall), it would still be better than incurring the
> > > cost of a full acquire barrier in each subsequent call, which ideally
> > > should have the same cost as a call to an empty function.
> >
> > That would be true if non-first calls appear
> > 1000*(syscall_overhead/acquire_mbar_overhead) times.  But do they?
>
> In theory they might. Imagine a math function that might be called
> millions or billions of times, but which depends on a precomputed
> table. Personally, my view of best-practices is that you should use
> 'static const' for such tables, even if they're huge, rather than
> runtime generation, but unfortunately I think my view is still a
> minority one...
>
> Also, keep in mind that even large overhead on the first call to
> pthread_once is likely to be small in comparison to the time spent in
> the initialization function, while even small overhead is huge in
> comparison to a call to pthread_once that doesn't call the
> initialization function.
>
> > I think the way forward here is to:
> > 1) Fix the implementation (ie, add the mbars).
> > 2) Let the arch maintainers of the affected archs with weak memory moels
> > (or people interested in this) look at this and come up with some
> > measurements for how much overhead the mbars actually present in real
> > code.
> > 3) Decide whether this overhead justifies adding optimizations.
> >
> > This patch is step 1.  I don't think we need to merge this step 3.
>
> I think this is a reasonable approach.
>
> > > > > Since it's impossible to track whether a call is the first call in a
> > > > > given thread
> > > >
> > > > Are you sure about this? :)
> > >
> > > It's impossible with bounded memory requirements, and thus impossible
> > > in general (allocating memory for the tracking might fail).
> >
> > I believe you think about needing to track more than you actually need
> > to know.  All you need is knowing whether a thread established a
> > happens-before with whoever initialized the once_control in the past.
> > So you do need per-thread state, and per-once_control state, but not
> > necessarily more.  If in doubt, you can still do the acquire barrier.
>
> The number of threads and the number of once controls are both
> unbounded.

They are bounded by the available memory :)  So if you can do with a
fixed amount of data in both thread state and once_control state, you
should be fine.

> You might could solve the problem with serial numbers if
> there were room to store a sufficiently large one in the once control,
> but the once control is 32-bit and the serial numbers could (in a
> pathological but valid application) easily overflow 32 bits.

The overflow can be an issue, but in that case I guess you can still try
to detect an overflow globally using global state, and just do the
acquire barrier in this case.
Informally, one can try to trade off a comparison of state in
once_control with a TLS variable; if that is significantly faster than
an acquire barrier, it can be useful; if it's about the same, it doesn't
make sense.

> > > I think my confusion is merely that POSIX does not define the phrase
> > > "synchronize memory", and in the absence of a definition, "full memory
> > > barrier" (both release and acquire semantics) is the only reasonable
> > > interpretation I can find. In other words, it seems like a
> > > pathological conforming program could attempt to use the language in
> > > the specification to use pthread_once as a release barrier. I'm not
> > > sure if there are ways this could be meaningfully arranged (i.e. with
> > > well-defined ordering; off-hand, I would think tricks with cancelling
> > > an in-progress invocation of pthread_once might make it possible.
> >
> > I agree that the absence of a proper memory model makes reasoning about
> > some of this hard.  I guess it would be best if POSIX would just endorse
> > C11's memory model, and specify the intended semantics in relation to
> > this model where needed.
>
> Agreed, and I suspect this is what they'll do. I can raise the issue,
> but perhaps you'd be better at expressing it. Let me know if you'd
> rather I do it.

I have no idea how the POSIX folks would feel about this.  After all, it
would create quite a dependency for POSIX.  With that in mind, trying to
resolve this isn't very high on my todo list.  If people would think
that this would be beneficial for how we can deal with POSIX
requirements, or for our users to understand the POSIX requirements
better, I can definitely try to follow up on this.  If you want to go
ahead and start discussing with them, please do so (please CC me on the
tracker bug).


Torvald


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Rich Felker
On Thu, May 09, 2013 at 05:14:28PM +0200, Torvald Riegel wrote:

> > > I agree that the absence of a proper memory model makes reasoning about
> > > some of this hard.  I guess it would be best if POSIX would just endorse
> > > C11's memory model, and specify the intended semantics in relation to
> > > this model where needed.
> >
> > Agreed, and I suspect this is what they'll do. I can raise the issue,
> > but perhaps you'd be better at expressing it. Let me know if you'd
> > rather I do it.
>
> I have no idea how the POSIX folks would feel about this.  After all, it
> would create quite a dependency for POSIX.  With that in mind, trying to
> resolve this isn't very high on my todo list.  If people would think
> that this would be beneficial for how we can deal with POSIX
> requirements, or for our users to understand the POSIX requirements
> better, I can definitely try to follow up on this.  If you want to go
> ahead and start discussing with them, please do so (please CC me on the
> tracker bug).

POSIX is aligned with ISO C, and since the current version of ISO C is
now the 2011 version, Issue 8 should be aligned to the 2011 version of
the C standard. I don't think the issue is whether it happens, but
making sure that the relevant text gets updated so that there's no
ambiguity as to whether it's compatible with the new C standard and
not placing unwanted additional implementation constraints like it may
be doing now.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
On Thu, 2013-05-09 at 11:56 -0400, Rich Felker wrote:

> On Thu, May 09, 2013 at 05:14:28PM +0200, Torvald Riegel wrote:
> > > > I agree that the absence of a proper memory model makes reasoning about
> > > > some of this hard.  I guess it would be best if POSIX would just endorse
> > > > C11's memory model, and specify the intended semantics in relation to
> > > > this model where needed.
> > >
> > > Agreed, and I suspect this is what they'll do. I can raise the issue,
> > > but perhaps you'd be better at expressing it. Let me know if you'd
> > > rather I do it.
> >
> > I have no idea how the POSIX folks would feel about this.  After all, it
> > would create quite a dependency for POSIX.  With that in mind, trying to
> > resolve this isn't very high on my todo list.  If people would think
> > that this would be beneficial for how we can deal with POSIX
> > requirements, or for our users to understand the POSIX requirements
> > better, I can definitely try to follow up on this.  If you want to go
> > ahead and start discussing with them, please do so (please CC me on the
> > tracker bug).
>
> POSIX is aligned with ISO C, and since the current version of ISO C is
> now the 2011 version, Issue 8 should be aligned to the 2011 version of
> the C standard. I don't think the issue is whether it happens, but
> making sure that the relevant text gets updated so that there's no
> ambiguity as to whether it's compatible with the new C standard and
> not placing unwanted additional implementation constraints like it may
> be doing now.

So, if it is aligned, would POSIX be willing to base their definitions
on the C11 memory model?  Or would they want to keep their sometimes
rather vague requirements and just make sure that there are no obvious
inconsistencies or gaps?


Torvald

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Rich Felker
On Fri, May 10, 2013 at 10:30:57AM +0200, Torvald Riegel wrote:

> On Thu, 2013-05-09 at 11:56 -0400, Rich Felker wrote:
> > On Thu, May 09, 2013 at 05:14:28PM +0200, Torvald Riegel wrote:
> > > > > I agree that the absence of a proper memory model makes reasoning about
> > > > > some of this hard.  I guess it would be best if POSIX would just endorse
> > > > > C11's memory model, and specify the intended semantics in relation to
> > > > > this model where needed.
> > > >
> > > > Agreed, and I suspect this is what they'll do. I can raise the issue,
> > > > but perhaps you'd be better at expressing it. Let me know if you'd
> > > > rather I do it.
> > >
> > > I have no idea how the POSIX folks would feel about this.  After all, it
> > > would create quite a dependency for POSIX.  With that in mind, trying to
> > > resolve this isn't very high on my todo list.  If people would think
> > > that this would be beneficial for how we can deal with POSIX
> > > requirements, or for our users to understand the POSIX requirements
> > > better, I can definitely try to follow up on this.  If you want to go
> > > ahead and start discussing with them, please do so (please CC me on the
> > > tracker bug).
> >
> > POSIX is aligned with ISO C, and since the current version of ISO C is
> > now the 2011 version, Issue 8 should be aligned to the 2011 version of
> > the C standard. I don't think the issue is whether it happens, but
> > making sure that the relevant text gets updated so that there's no
> > ambiguity as to whether it's compatible with the new C standard and
> > not placing unwanted additional implementation constraints like it may
> > be doing now.
>
> So, if it is aligned, would POSIX be willing to base their definitions
> on the C11 memory model?  Or would they want to keep their sometimes
> rather vague requirements and just make sure that there are no obvious
> inconsistencies or gaps?

My guess is that they would adopt the C11 model.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Carlos O'Donell-6
In reply to this post by Torvald Riegel-4
On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background.

You've already hashed out the details of these changes with Rich
and he has no objection with this first phase of the patch which
is to unify the implementations.
 

> I1 and I2 follow essentially the same algorithm, and we can replace it
> with a unified variant, as the bug suggests.  See the attached patch for
> a modified version of the sparc instance.  The differences between both
> are either cosmetic, or are unnecessary changes (ie, how the
> init-finished state is set (atomic_inc vs. store), or how the fork
> generations are compared).
>
> Both I1 and I2 were missing a release memory order (MO) when marking
> once_control as finished initialization.  If the particular arch doesn't
> need a HW barrier for release, we at least need a compiler barrier; if
> it's needed, the original I1 and I2 are not guaranteed to work.
>
> Both I1 and I2 were missing acquire MO on the very first load of
> once_control.  This needs to synchronize with the release MO on setting
> the state to init-finished, so without it it's not guaranteed to work
> either.
> Note that this will make a call to pthread_once that doesn't need to
> actually run the init routine slightly slower due to the additional
> acquire barrier.  If you're really concerned about this overhead, speak
> up.  There are ways to avoid it, but it comes with additional complexity
> and bookkeeping.

We want correctness. This is a place where correctness is infinitely
more important than speed. We should be correct first and then we
should argue about how to make it fast.

> I'm currently also using the existing atomic_{read/write}_barrier
> functions instead of not-yet-existing load_acq or store_rel functions.
> I'm not sure whether the latter can have somewhat more efficient
> implementations on Power and ARM; if so, and if you're concerned about
> the overhead, we can add load_acq and store_rel to atomic.h and start
> using it.  This would be in line with C11, where we should eventually be
> heading to anyways, IMO.

Agreed.

> Both I1 and I2 have an ABA issue on __fork_generation, as explained in
> the comments that the patch adds.  How do you all feel about this?
> I can't present a simple fix right now, but I believe it could be fixed
> with additional bookkeeping.
>
> If there's no objection to the essence of this patch, I'll post another
> patch that actually replaces I1 and I2 with the modified variant in the
> attached patch.

Please repost.

> Cleaning up the magic numbers, perhaps fixing the ABA issue, and
> comparing to the custom asm versions would be next.  I had a brief look
> at the latter, and at least x86 doesn't seem to do anything logically
> different.

Right, that can be another step.
 

> diff --git a/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
> index 5879f44..f9b0953 100644
> --- a/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
> +++ b/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
> @@ -28,11 +28,31 @@ clear_once_control (void *arg)
>  {
>    pthread_once_t *once_control = (pthread_once_t *) arg;
>  
> +  /* Reset to the uninitialized state here (see __pthread_once).  Also, we
> +     don't need a stronger memory order because we do not need to make any
> +     other of our writes visible to other threads that see this value.  */
>    *once_control = 0;
>    lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
>  }
>  
>  
> +/* This is similar to a lock implementation, but we distinguish between three
> +   states: not yet initialized (0), initialization finished (2), and
> +   initialization in progress (__fork_generation | 1).  If in the first state,
> +   threads will try to run the initialization by moving to the second state;
> +   the first thread to do so via a CAS on once_control runs init_routine,
> +   other threads block.
> +   When forking the process, some threads can be interrupted during the second
> +   state; they won't be present in the forked child, so we need to restart
> +   initialization in the child.  To distinguish an in-progress initialization
> +   from an interrupted initialization (in which case we need to reclaim the
> +   lock), we look at the fork generation that's part of the second state: We
> +   can reclaim iff it differs from the current fork generation.
> +   XXX: This algorithm has an ABA issue on the fork generation: If an
> +   initialization is interrupted, we then fork 2^30 times (30b of once_control
> +   are used for the fork generation), and try to initialize again, we can
> +   deadlock because we can't distinguish the in-progress and interrupted cases
> +   anymore.  */

Good comment. Good note on the ABA issue, even if somewhat impractical today.

>  int
>  __pthread_once (once_control, init_routine)
>       pthread_once_t *once_control;
> @@ -42,15 +62,26 @@ __pthread_once (once_control, init_routine)
>      {
>        int oldval, val, newval;
>  
> +      /* We need acquire memory order for this load because if the value
> +         signals that initialization has finished, we need to be see any
> +         data modifications done during initialization.  */
>        val = *once_control;
> +      atomic_read_barrier();
>        do
>   {
> -  /* Check if the initialized has already been done.  */
> -  if ((val & 2) != 0)
> +  /* Check if the initialization has already been done.  */
> +  if (__builtin_expect ((val & 2) != 0, 1))
>      return 0;
>  
>    oldval = val;
> -  newval = (oldval & 3) | __fork_generation | 1;
> +  /* We try to set the state to in-progress and having the current
> +     fork generation.  We don't need atomic accesses for the fork
> +     generation because it's immutable in a particular process, and
> +     forked child processes start with a single thread that modified
> +     the generation.  */
> +  newval = __fork_generation | 1;
> +  /* We need acquire memory order here for the same reason as for the
> +     load from once_control above.  */
>    val = atomic_compare_and_exchange_val_acq (once_control, newval,
>       oldval);
>   }
> @@ -59,9 +90,10 @@ __pthread_once (once_control, init_routine)
>        /* Check if another thread already runs the initializer. */
>        if ((oldval & 1) != 0)
>   {
> -  /* Check whether the initializer execution was interrupted
> -     by a fork. */
> -  if (((oldval ^ newval) & -4) == 0)
> +  /* Check whether the initializer execution was interrupted by a
> +     fork. (We know that for both values, bit 0 is set and bit 1 is
> +     not.)  */
> +  if (oldval == newval)
>      {
>        /* Same generation, some other thread was faster. Wait.  */
>        lll_futex_wait (once_control, newval, LLL_PRIVATE);
> @@ -79,8 +111,11 @@ __pthread_once (once_control, init_routine)
>        pthread_cleanup_pop (0);
>  
>  
> -      /* Add one to *once_control.  */
> -      atomic_increment (once_control);
> +      /* Mark *once_control as having finished the initialization.  We need
> +         release memory order here because we need to synchronize with other
> +         threads that want to use the initialized data.  */
> +      atomic_write_barrier();
> +      *once_control = 2;
>  
>        /* Wake up all other threads.  */
>        lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);

Cheers,
Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Ondřej Bílka
On Thu, May 23, 2013 at 12:15:32AM -0400, Carlos O'Donell wrote:

> On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> > Note that this will make a call to pthread_once that doesn't need to
> > actually run the init routine slightly slower due to the additional
> > acquire barrier.  If you're really concerned about this overhead, speak
> > up.  There are ways to avoid it, but it comes with additional complexity
> > and bookkeeping.
>
> We want correctness. This is a place where correctness is infinitely
> more important than speed. We should be correct first and then we
> should argue about how to make it fast.
>
As pthread_once calls tend to be called once per thread performance is
not an issue.

> > If there's no objection to the essence of this patch, I'll post another
> > patch that actually replaces I1 and I2 with the modified variant in the
> > attached patch.
>
> Please repost.
>
We wait for new version.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Rich Felker
On Mon, Aug 26, 2013 at 02:49:55PM +0200, Ondřej Bílka wrote:

> On Thu, May 23, 2013 at 12:15:32AM -0400, Carlos O'Donell wrote:
> > On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> > > Note that this will make a call to pthread_once that doesn't need to
> > > actually run the init routine slightly slower due to the additional
> > > acquire barrier.  If you're really concerned about this overhead, speak
> > > up.  There are ways to avoid it, but it comes with additional complexity
> > > and bookkeeping.
> >
> > We want correctness. This is a place where correctness is infinitely
> > more important than speed. We should be correct first and then we
> > should argue about how to make it fast.
> >
> As pthread_once calls tend to be called once per thread performance is
> not an issue.

No, pthread_once _calls_ tend to be once per access to an interface
that requires static data to have been initialized, so possibly very
often. On the other hand, pthread_once only invokes the init function
once per program instance. I don't see anything that would typically
happen once per thread, although I suppose you could optimize out
calls to pthread_once with tls:

    static __thread int once_done = 0;
    static pthread_once_t once;
    if (!once_done) {
        pthread_once(&once, init);
        once_done = 1;
    }

This requires work at the application level, though, and whether it's
a net advantage depends a lot on whether multiple threads are likely
to be hammering pthread_once on the same once object, and whether the
arch has expensive acquire barriers and inexpensive TLS access.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Ondřej Bílka
On Mon, Aug 26, 2013 at 12:45:07PM -0400, Rich Felker wrote:

> On Mon, Aug 26, 2013 at 02:49:55PM +0200, Ondřej Bílka wrote:
> > On Thu, May 23, 2013 at 12:15:32AM -0400, Carlos O'Donell wrote:
> > > On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> > > > Note that this will make a call to pthread_once that doesn't need to
> > > > actually run the init routine slightly slower due to the additional
> > > > acquire barrier.  If you're really concerned about this overhead, speak
> > > > up.  There are ways to avoid it, but it comes with additional complexity
> > > > and bookkeeping.
> > >
> > > We want correctness. This is a place where correctness is infinitely
> > > more important than speed. We should be correct first and then we
> > > should argue about how to make it fast.
> > >
> > As pthread_once calls tend to be called once per thread performance is
> > not an issue.
>
> No, pthread_once _calls_ tend to be once per access to an interface
> that requires static data to have been initialized, so possibly very
> often. On the other hand, pthread_once only invokes the init function
> once per program instance. I don't see anything that would typically
> happen once per thread, although I suppose you could optimize out
> calls to pthread_once with tls:
>
Could happen often but dees it? Given need of doing locking you need to
avoid it in performance critical code. With once per thread I meant an
patterns:

computation(){
  pthread_once(baz,init); // Do common initialization.
  pthread_create(foo,bar,routine);
}

or

pthread_create(foo,bar,routine);

with

routine()
  {
    pthread_once(baz,init); // Do common initialization.
    ...
  }

>     static __thread int once_done = 0;
>     static pthread_once_t once;
>     if (!once_done) {
>         pthread_once(&once, init);
>         once_done = 1;
>     }
>
> This requires work at the application level, though, and whether it's
> a net advantage depends a lot on whether multiple threads are likely
> to be hammering pthread_once on the same once object, and whether the
> arch has expensive acquire barriers and inexpensive TLS access.
>
Actually you can use following if you are concerned about that use cases:

#define pthread_once2(x,y) ({   \
  static __thread int once = 0; \
  if (!once)                    \
    pthread_once(x,y);          \
  once=1;                       \
})

> Rich

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Rich Felker
On Mon, Aug 26, 2013 at 08:41:50PM +0200, Ondřej Bílka wrote:

> > No, pthread_once _calls_ tend to be once per access to an interface
> > that requires static data to have been initialized, so possibly very
> > often. On the other hand, pthread_once only invokes the init function
> > once per program instance. I don't see anything that would typically
> > happen once per thread, although I suppose you could optimize out
> > calls to pthread_once with tls:
> >
> Could happen often but dees it? Given need of doing locking you need to
> avoid it in performance critical code. With once per thread I meant an
> patterns:
>
> computation(){
>   pthread_once(baz,init); // Do common initialization.
>   pthread_create(foo,bar,routine);
> }
>
> or
>
> pthread_create(foo,bar,routine);
>
> with
>
> routine()
>   {
>     pthread_once(baz,init); // Do common initialization.
>     ...
>   }

These patterns arise is the library is making threads and using
pthread_once to initialize its static data before making the thread.
I'm thinking instead of the case where your library is being _called_
by multi-threaded code, and using pthread_once to ensure that its data
is safely initialized even if there are multiple threads which might
be racing to be the first caller.

> >     static __thread int once_done = 0;
> >     static pthread_once_t once;
> >     if (!once_done) {
> >         pthread_once(&once, init);
> >         once_done = 1;
> >     }
> >
> > This requires work at the application level, though, and whether it's
> > a net advantage depends a lot on whether multiple threads are likely
> > to be hammering pthread_once on the same once object, and whether the
> > arch has expensive acquire barriers and inexpensive TLS access.
> >
> Actually you can use following if you are concerned about that use cases:
>
> #define pthread_once2(x,y) ({   \
>   static __thread int once = 0; \
>   if (!once)                    \
>     pthread_once(x,y);          \
>   once=1;                       \
> })

Indeed; actually, this could even be done in pthread.h, with some
slight variations, perhaps:

#define pthread_once(x,y) ({                  \
  pthread_once_t *__x = (x);                  \
  static __thread pthread_once_t *__once;     \
  if (__once != __x) {                        \
    pthread_once(__x,y);                      \
    __once = __x;                             \
  }                                           \
})

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
In reply to this post by Carlos O'Donell-6
On Thu, 2013-05-23 at 00:15 -0400, Carlos O'Donell wrote:

> On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> > See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background.
>
> You've already hashed out the details of these changes with Rich
> and he has no objection with this first phase of the patch which
> is to unify the implementations.
>  
> > I1 and I2 follow essentially the same algorithm, and we can replace it
> > with a unified variant, as the bug suggests.  See the attached patch for
> > a modified version of the sparc instance.  The differences between both
> > are either cosmetic, or are unnecessary changes (ie, how the
> > init-finished state is set (atomic_inc vs. store), or how the fork
> > generations are compared).
> >
> > Both I1 and I2 were missing a release memory order (MO) when marking
> > once_control as finished initialization.  If the particular arch doesn't
> > need a HW barrier for release, we at least need a compiler barrier; if
> > it's needed, the original I1 and I2 are not guaranteed to work.
> >
> > Both I1 and I2 were missing acquire MO on the very first load of
> > once_control.  This needs to synchronize with the release MO on setting
> > the state to init-finished, so without it it's not guaranteed to work
> > either.
> > Note that this will make a call to pthread_once that doesn't need to
> > actually run the init routine slightly slower due to the additional
> > acquire barrier.  If you're really concerned about this overhead, speak
> > up.  There are ways to avoid it, but it comes with additional complexity
> > and bookkeeping.
>
> We want correctness. This is a place where correctness is infinitely
> more important than speed. We should be correct first and then we
> should argue about how to make it fast.
>
> > I'm currently also using the existing atomic_{read/write}_barrier
> > functions instead of not-yet-existing load_acq or store_rel functions.
> > I'm not sure whether the latter can have somewhat more efficient
> > implementations on Power and ARM; if so, and if you're concerned about
> > the overhead, we can add load_acq and store_rel to atomic.h and start
> > using it.  This would be in line with C11, where we should eventually be
> > heading to anyways, IMO.
>
> Agreed.
>
> > Both I1 and I2 have an ABA issue on __fork_generation, as explained in
> > the comments that the patch adds.  How do you all feel about this?
> > I can't present a simple fix right now, but I believe it could be fixed
> > with additional bookkeeping.
> >
> > If there's no objection to the essence of this patch, I'll post another
> > patch that actually replaces I1 and I2 with the modified variant in the
> > attached patch.
>
> Please repost.
See attached patch.  This has been tested on ppc64 but not on the other
archs that are affected.  Nonetheless, ppc has a weak memory model, so,
for example, having an acquire barrier on a load or not having it does
make a difference.

OK?

patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
On Sun, 2013-10-06 at 02:20 +0200, Torvald Riegel wrote:

> On Thu, 2013-05-23 at 00:15 -0400, Carlos O'Donell wrote:
> > On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> > > See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background.
> >
> > You've already hashed out the details of these changes with Rich
> > and he has no objection with this first phase of the patch which
> > is to unify the implementations.
> >  
> > > I1 and I2 follow essentially the same algorithm, and we can replace it
> > > with a unified variant, as the bug suggests.  See the attached patch for
> > > a modified version of the sparc instance.  The differences between both
> > > are either cosmetic, or are unnecessary changes (ie, how the
> > > init-finished state is set (atomic_inc vs. store), or how the fork
> > > generations are compared).
> > >
> > > Both I1 and I2 were missing a release memory order (MO) when marking
> > > once_control as finished initialization.  If the particular arch doesn't
> > > need a HW barrier for release, we at least need a compiler barrier; if
> > > it's needed, the original I1 and I2 are not guaranteed to work.
> > >
> > > Both I1 and I2 were missing acquire MO on the very first load of
> > > once_control.  This needs to synchronize with the release MO on setting
> > > the state to init-finished, so without it it's not guaranteed to work
> > > either.
> > > Note that this will make a call to pthread_once that doesn't need to
> > > actually run the init routine slightly slower due to the additional
> > > acquire barrier.  If you're really concerned about this overhead, speak
> > > up.  There are ways to avoid it, but it comes with additional complexity
> > > and bookkeeping.
> >
> > We want correctness. This is a place where correctness is infinitely
> > more important than speed. We should be correct first and then we
> > should argue about how to make it fast.
> >
> > > I'm currently also using the existing atomic_{read/write}_barrier
> > > functions instead of not-yet-existing load_acq or store_rel functions.
> > > I'm not sure whether the latter can have somewhat more efficient
> > > implementations on Power and ARM; if so, and if you're concerned about
> > > the overhead, we can add load_acq and store_rel to atomic.h and start
> > > using it.  This would be in line with C11, where we should eventually be
> > > heading to anyways, IMO.
> >
> > Agreed.
> >
> > > Both I1 and I2 have an ABA issue on __fork_generation, as explained in
> > > the comments that the patch adds.  How do you all feel about this?
> > > I can't present a simple fix right now, but I believe it could be fixed
> > > with additional bookkeeping.
> > >
> > > If there's no objection to the essence of this patch, I'll post another
> > > patch that actually replaces I1 and I2 with the modified variant in the
> > > attached patch.
> >
> > Please repost.
>
> See attached patch.  This has been tested on ppc64 but not on the other
> archs that are affected.  Nonetheless, ppc has a weak memory model, so,
> for example, having an acquire barrier on a load or not having it does
> make a difference.
>
> OK?
Attached is a slightly updated version; the only difference is that the
changelog chunks now incorporate earlier feedback that I had forgotten
about.

patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Joseph Myers
In reply to this post by Torvald Riegel-4
I have no comments on the substance of this patch, but note that ports/
has a separate ChangeLog file for each architecture.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Torvald Riegel-4
On Mon, 2013-10-07 at 16:04 +0000, Joseph S. Myers wrote:
> I have no comments on the substance of this patch, but note that ports/
> has a separate ChangeLog file for each architecture.

Sorry. The attached patch now has separate ChangeLog entries for each of
the affected archs.

patch (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Unify pthread_once (bug 15215)

Will Newton
On 7 October 2013 22:53, Torvald Riegel <[hidden email]> wrote:
> On Mon, 2013-10-07 at 16:04 +0000, Joseph S. Myers wrote:
>> I have no comments on the substance of this patch, but note that ports/
>> has a separate ChangeLog file for each architecture.
>
> Sorry. The attached patch now has separate ChangeLog entries for each of
> the affected archs.

There seems to be a significant performance delta on aarch64:

Old code:

"pthread_once": {
"": {
"duration": 9.29471e+09, "iterations": 1.10667e+09, "max": 24.54,
"min": 8.38, "mean": 8.39882

New code:

"pthread_once": {
"": {
"duration": 9.72366e+09, "iterations": 4.33843e+08, "max": 30.86,
"min": 22.38, "mean": 22.4128

And also ARM:

Old code:

"pthread_once": {
"": {
"duration": 8.38662e+09, "iterations": 6.6695e+08, "max": 35.292,
"min": 12.416, "mean": 12.5746

New code:

"pthread_once": {
"": {
"duration": 9.26424e+09, "iterations": 3.07574e+08, "max": 86.125,
"min": 28.875, "mean": 30.1204

It would be nice to understand the source of this variation. I can put
it on my todo list but I can't promise I will be able to look at it
any time soon.

--
Will Newton
Toolchain Working Group, Linaro
12