[ANN] Userspace M-on-N threading model implementation. Alpha release.

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov
Hello.

I'm pleased to announce initial userspace M-on-N threading model
implementation (for hackers) called NTL.
This is first alpha release, which indeed has bugs and limitations.

Userspace M-on-N threading model is based on the idea, that when signal
is delivered, kernel saves all information related to previous context
in stack, so it is possible to find it and replace.

M-on-N threading model compared to usual NPTL 1-on-1 model has following
advantages and disadvantages:

Benefits.

1. Fast scheduling.
There is no need to cross userspace/kernelspace boundary to schedule new
thread execution (just watch what happens with userspace network stack
compared to kernel's one when there are a lot of syscalls performed for
small packets receiving/sending).

2. Fast thread creation and destruction.
It just becomes an allocation of the structure in the userspace, no need
for full creation process which is performed in clone() syscall.

3. Smaller number of cache misses.
Since there is only one process instead of several threads, cache
locality is increased greatly with reduced number of misses.

Drawbacks.

1. Scheduling fairness.
Since kernel does not know about multiple threads behind given process,
it can not add it appropriate number of timeslices for execution.
Can be solved either by more tight collaboarion of the userspace and
kernelspace schedulers or simply by increasing process' nice value.

2. All communications are performed through one kevent pipe. (TODO)
All syscalls are going to be converted into non-blocking operations
(including nanosleep() and the like), and keep a track of what each
context performed. In practice glibc rewrite is not what I would like to
do, but instead some layer on top of it will be implemented, which will
convert syscalls into kevent operations, and become a rescheduling
point.

3. Complex code for good SMP scalability and userspace scheduler.
Not a problem. (TESTING)

SMP scalability in M-on-N threading model.

Since only kernel can schedule thread (actually not even thread or
process, but its own kernel's representation, so called kernel's virtual
process) to run on specified CPU, M-on-N threading model should have
several real threads (for example several current POSIX threads), its
number should be equal to number of real CPUs, and then library layer
will schedule execution of context of different real threads, each of
which in turn can run on separate CPU.

So, userspace will create new real threads when pthread_create() is
called until number of them is less than number of real CPUs, each real
thread in turn is a context in the global set of contexts, where fake
context will be added with all subsequent pthread_create() calls, and
userspace scheduler (backed by real threads) will pick up several
contexts from the tree and execute them on the real CPUs.

I would be possible to use existing Linux clone() syscall, but due to
complete absence of hte documentation (which is sometimes plain wrong)
and ery strong encryption of glibc sources it is quite complex task.

As NPTL, M-on-N threading library uses stack rlimit for thread stack
allocation.

Benchmarks.

I only ran simple benchmark of empty thread creation (its function just
exits).
After I started to use atomic locks ("lock" prefix on x86) instead of
semaphores, thread start/empty exec/stop was reduced down to 0.3
microseconds compared to 14 microsecods for POSIX NPTL case.

But there are problems.
First one is that I perform initial context setup through signal
invokation, which is at least two syscalls. They are slow.
Another one is that thread is really started only after rescheduling,
which is another signal, so another two syscalls.
Third on is that there must exist different locking primitives - for
signal context and for process context, which must block signals, which
in turn adds additional overhead of sigprocmask() syscall.

After I fixed all above issues (actually not fixed, but confirmed that
they must exist), performance reduced to 9 microseconds compared to 14
microsecods for POSIX NPTL case for empty thread creation/destruction.

(Test machine is Core Duo 2.4 Ghz (run at 3.7) with 2 GB of ram).

This can be fixed, if I would have created arch-specific
getcontext()-like calls, which would be mutually transformable into
signal context information (existing getcontext() and friends produces
different data than signal context has at least on x86). But I can not
right now, since I do not know enough x86 ABI (I learned a lot for past
several days, as you can notice from this blog, but it is still even
remotely not enough).

Currently M-on-N threading model uses ugly arch-specific hacks to start
new threads, which actually are something remotely similar to
makecontext().
So, the solution, which will rock M-on-N threading implementation is to
convert or create getcontext() and friends calls which can be used with
signal context information.

Another limitations are:
* x86 only (I do not have different test boxes to learn different asm)
* does not work if compiled with position-independent code support
* does not work if some functions are inlined (so -fno-inline flag)
* no support for run-time syscall substitution (to make it rescheduling
  point) yet
* looks like a real hack

Advantages:
* it is faster (noticebly faster)
* it is simpler
* code is not encrypted like in glibc sources

Sources and developement comments can be downloaded from NTL homepage at
http://tservice.net.ru/~s0mbre/old/?section=projects&item=threading

Archive is also attached for interested reader.

Thank you.

Signed-off-by: Evgeniy Polyakov <[hidden email]>

--
        Evgeniy Polyakov

threading-2007_01_29.tar.gz (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov
P.S. I'm not subscribed to any of the above lists, please Cc: me in
replies.


--
        Evgeniy Polyakov
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Chris Friesen
In reply to this post by Evgeniy Polyakov
Evgeniy Polyakov wrote:
> Hello.
>
> I'm pleased to announce initial userspace M-on-N threading model
> implementation (for hackers) called NTL.

If you haven't already, I suggest you look into the story of NGPT and
also read the NPTL white paper
(http://people.redhat.com/drepper/nptl-design.pdf) especially section
5.1 describing why they went with a 1:1 model.

Chris
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Samuel Thibault-2
In reply to this post by Evgeniy Polyakov
Hi,

Evgenity, le Mon 29 Jan 2007 16:47:36 +0100, a écrit :
> Userspace M-on-N threading model is based on the idea, that when signal
> is delivered, kernel saves all information related to previous context
> in stack, so it is possible to find it and replace.

You may want to have a look at some existing implementations:

- Good old `FSU Pthreads' http://moss.csc.ncsu.edu/~mueller/pthreads/
- fully POSIX-compliant `GnuPth' http://www.gnu.org/software/pth/
- server-targetted `Capriccio'
www.cs.berkeley.edu/~jcondit/capriccio-sosp-2003.pdf
- efficient `ELiTE/Erlangen'
http://www4.informatik.uni-erlangen.de/Projects/FORTWIHR/ELiTE/
- and our portable, flexible, efficient `Marcel'
http://runtime.futurs.inria.fr/marcel/

Samuel
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov
In reply to this post by Chris Friesen
On Mon, Jan 29, 2007 at 10:40:42AM -0600, Chris Friesen ([hidden email]) wrote:

> Evgeniy Polyakov wrote:
> >Hello.
> >
> >I'm pleased to announce initial userspace M-on-N threading model
> >implementation (for hackers) called NTL.
>
> If you haven't already, I suggest you look into the story of NGPT and
> also read the NPTL white paper
> (http://people.redhat.com/drepper/nptl-design.pdf) especially section
> 5.1 describing why they went with a 1:1 model.

Of course I read this, but it does not change anything.

NGPT had 2 times slower start time than NPTL, and NTL has 2-20 times
faster than NPTL, so I think NGPT had too major problems to get it
into comparison.

I described in details why and how M:N model better, and its drawbacks
include all issues mentioned by Ulrich Drepper, but nevertheless its
advantages are far too superiour than those which can be provided by 1:1
model.

> Chris

--
        Evgeniy Polyakov
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov
In reply to this post by Samuel Thibault-2
On Tue, Jan 30, 2007 at 02:18:17AM +0100, Samuel Thibault ([hidden email]) wrote:
> Hi,
>
> Evgenity, le Mon 29 Jan 2007 16:47:36 +0100, a écrit :
> > Userspace M-on-N threading model is based on the idea, that when signal
> > is delivered, kernel saves all information related to previous context
> > in stack, so it is possible to find it and replace.
>
> You may want to have a look at some existing implementations:

I saw most of them.
As far as I recall, only PTL (is not shown here) has preemptible
scheduler. NTL has it too, but is based on different approach.

> - Good old `FSU Pthreads' http://moss.csc.ncsu.edu/~mueller/pthreads/
> - fully POSIX-compliant `GnuPth' http://www.gnu.org/software/pth/
> - server-targetted `Capriccio'
> www.cs.berkeley.edu/~jcondit/capriccio-sosp-2003.pdf
> - efficient `ELiTE/Erlangen'
> http://www4.informatik.uni-erlangen.de/Projects/FORTWIHR/ELiTE/
> - and our portable, flexible, efficient `Marcel'
> http://runtime.futurs.inria.fr/marcel/
>
> Samuel

--
        Evgeniy Polyakov
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Samuel Thibault
Evgeniy Polyakov, le Tue 30 Jan 2007 12:53:16 +0300, a écrit :
> > You may want to have a look at some existing implementations:
>
> I saw most of them.
> As far as I recall, only PTL (is not shown here) has preemptible
> scheduler. NTL has it too, but is based on different approach.

Marcel has a preemptible scheduler too, based on signals and
setjmp/longjmp.

Samuel
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov
On Tue, Jan 30, 2007 at 11:24:51AM +0100, Samuel Thibault ([hidden email]) wrote:
> Evgeniy Polyakov, le Tue 30 Jan 2007 12:53:16 +0300, a écrit :
> > > You may want to have a look at some existing implementations:
> >
> > I saw most of them.
> > As far as I recall, only PTL (is not shown here) has preemptible
> > scheduler. NTL has it too, but is based on different approach.
>
> Marcel has a preemptible scheduler too, based on signals and
> setjmp/longjmp.

Do some documentation and benchmarks exist for that library - site seems
to only described environment is was created for? How does blocking
problem solved?

> Samuel

--
        Evgeniy Polyakov
Reply | Threaded
Open this post in threaded view
|

RE: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Kaz Kylheku-3
In reply to this post by Evgeniy Polyakov
Evgeniy Polyakov wrote:
> I described in details why and how M:N model better, and its drawbacks
> include all issues mentioned by Ulrich Drepper, but nevertheless its
> advantages are far too superiour than those which can be
> provided by 1:1
> model.

M:N threading is an unnecessary performance hack that's needed by people
who are living in a C or C++ exile away from some language that has
lexical closures, generators or first-class continuations. Not having
these niceties, they resort to emulating them with threads. The proper
thing to do is to rewrite the code to use state machines which can be
driven by any available thread. Or else, write yourself a
source-to-source transformer that will give C the lexical closure,
generator, or continuation features that you need to express the
solution that way.

There is no need to retain any vestiges of a user space threading
implementation when you have the real thing.

Programs which appear to benefit from that model are badly optimized or
badly designed. A smartly written program uses an available thread to do
as much work as possible, until that thread happens to block or its time
slice burns up.
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov
On Tue, Jan 30, 2007 at 01:16:22PM -0800, Kaz Kylheku ([hidden email]) wrote:

> Evgeniy Polyakov wrote:
> > I described in details why and how M:N model better, and its drawbacks
> > include all issues mentioned by Ulrich Drepper, but nevertheless its
> > advantages are far too superiour than those which can be
> > provided by 1:1
> > model.
>
> M:N threading is an unnecessary performance hack that's needed by people
> who are living in a C or C++ exile away from some language that has
> lexical closures, generators or first-class continuations. Not having
> these niceties, they resort to emulating them with threads. The proper
> thing to do is to rewrite the code to use state machines which can be
> driven by any available thread. Or else, write yourself a
> source-to-source transformer that will give C the lexical closure,
> generator, or continuation features that you need to express the
> solution that way.
>
> There is no need to retain any vestiges of a user space threading
> implementation when you have the real thing.
>
> Programs which appear to benefit from that model are badly optimized or
> badly designed. A smartly written program uses an available thread to do
> as much work as possible, until that thread happens to block or its time
> slice burns up.

Do not mix languages like Erlang, specialy designed for concurrent
programming, with M:N threading model - they are completely different,
but you do not want to see this. As you pointed, one thread can do as
much as it need until it is blocked, and what next? Allocate new real
thread? You may want to see how things like JVM work, I seriously doubt
spwning new thread each time task blocks is a way to go. Even having
epoll does not help in many cases. And you forgot the price of
rescheduling in kernelspace and userspace - even with signals it differs
two times, with more intellegent case it differs in 20 times!
Virtual machine can have thousands of threads, actually it cant, since
it will kill Linux in rescheduling.

--
        Evgeniy Polyakov
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Lee Revell
In reply to this post by Evgeniy Polyakov
On 1/29/07, Evgeniy Polyakov <[hidden email]> wrote:
> 1. Scheduling fairness.
> Since kernel does not know about multiple threads behind given process,
> it can not add it appropriate number of timeslices for execution.
> Can be solved either by more tight collaboarion of the userspace and
> kernelspace schedulers or simply by increasing process' nice value.

nice value is only meaningful for SCHED_OTHER.  How will you handle a
multithreaded realtime application that uses SCHED_OTHER as well as
SCHED_FIFO threads?

Lee
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov
On Wed, Jan 31, 2007 at 11:28:07PM -0500, Lee Revell ([hidden email]) wrote:

> On 1/29/07, Evgeniy Polyakov <[hidden email]> wrote:
> >1. Scheduling fairness.
> >Since kernel does not know about multiple threads behind given process,
> >it can not add it appropriate number of timeslices for execution.
> >Can be solved either by more tight collaboarion of the userspace and
> >kernelspace schedulers or simply by increasing process' nice value.
>
> nice value is only meaningful for SCHED_OTHER.  How will you handle a
> multithreaded realtime application that uses SCHED_OTHER as well as
> SCHED_FIFO threads?

Threads created inside one process obviously can not compete with RT
threads created by other process. Instead process, which threads have RT
priority itself should change its priority to RT to compete with other
RT process. (By RT I mean any cases except SCHED_OTHER which is
default).

> Lee

--
        Evgeniy Polyakov
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Bill Davidsen
In reply to this post by Kaz Kylheku-3
Kaz Kylheku wrote:

> Evgeniy Polyakov wrote:
>> I described in details why and how M:N model better, and its drawbacks
>> include all issues mentioned by Ulrich Drepper, but nevertheless its
>> advantages are far too superiour than those which can be
>> provided by 1:1
>> model.
>
> M:N threading is an unnecessary performance hack that's needed by people
> who are living in a C or C++ exile away from some language that has
> lexical closures, generators or first-class continuations.

Yes, that's called the "real world." Arguments of the "I don't need it,
in a perfect world you wouldn't either, therefore it's a bad idea" type
simply contribute nothing.

Because user threading can avoid context switches, there will always be
cases where it will outperform o/s threads for hardware reasons.

--
Bill Davidsen <[hidden email]>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot