FOR REVIEW: New x86-64 vsyscall vgetcpu()

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen

I got several requests over the years to provide a fast way to get
the current CPU and node on x86-64.  That is useful for a couple of things:

- The kernel gets a lot of benefit from using per CPU data to get better
cache locality and avoid cache line bouncing. This is currently
not quite possible for user programs. With a fast way to know the current
CPU user space can use per CPU data that is likely in cache already.
Locking is still needed of course - after all the thread might switch
to a different CPU - but at least the memory should be already in cache
and locking on cached memory is much cheaper.

- For NUMA optimization in user space you really need to know the current
node to find out where to allocate memory from.
If you allocate a fresh page from the kernel the kernel will give you
one in the current node, but if you keep your own pools like most programs
do you need to know this to select the right pool.
On single threaded programs it is usually not a big issue because they
tend to start on one node, allocate all their memory there and then eventually
use it there too, but on multithreaded programs where threads can
run on different nodes it's a bigger problem to make sure the threads
can get node local memory for best performance.

At first look such a call still looks like a bad idea - after all the kernel can
switch a process at any time to other CPUs so any result of this call might
be wrong as soon as it returns.

But at a closer look it really makes sense:
- The kernel has strong thread affinity and usually keeps a process on the
same CPU. So switching CPUs is rare. This makes it an useful optimization.

The alternative is usually to bind the process to a specific CPU - then it
"know" where it is - but the problem is that this is nasty to use and
requires user configuration. The kernel often can make better decisions on
where to schedule. And doing it automatically makes it just work.

This cannot be done effectively in user space because only the kernel
knows how to get this information from the CPUs because it  requires
translating local APIC numbers to Linux CPU numbers.

Doing it in a syscall is too slow so doing it in a vsyscall makes sense.

I have patches now in my tree from Vojtech
ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/getcpu-vsyscall
(note doesn't apply on its own, needs earlier patches in the quilt set)

The prototype is

long vgetcpu(int *cpu, int *node, unsigned long *tcache)

cpu gets the current CPU number if not NULL.
node gets the current node number if not NULL
tcache is a pointer to a two element long array, can be also NULL. Described below.
Return is always 0.

[I modified the prototype a bit over Vojtech's original implementation
to be more foolproof and add the caching mechanism]

Unfortunately all ways to get this information from the CPU are still relatively slow:
it supports RDTSCP on CPUs that support it and CPUID(1) otherwise. Unfortunately
they both are relatively slow.
 
They stall the pipeline and add some overhead
so I added a special caching mechanism. The idea is that if it's a little
slow then user space would likely cache the information anyways. The problem
with caching is that you need a way to find if it's out of date. User space
cannot do this because it doesn't have a fast way to access a time stamp.

But the x86-64 vsyscall implementation happens to incidentally - vgettimeofday()
already has access to jiffies, that can be just used as a timestamp to
invalidate the cache. The vsyscall cannot cache this information by itself
though - it doesn't have any storage. The idea is that the user would pass a
TLS variable in there which is then used for storage.  With that the information
can be at best a jiffie out of date, which is good enough.

The contents of the cache are theoretically supposed to be opaque (although I'm
sure user programs  will soon abuse that because it will such a convenient way
to get at jiffies ..). I've considered xoring it with a value to make it clear
it's not, but that is probably overkill (?). Might be still safer because
jiffies is unsafe to use in user space because the unit might change.

The array is slightly ugly - one open possibility is to replace it with
a structure. Shouldn't make much difference to the general semantics of the syscall though.

Some numbers:  (the getpid is to compare syscall cost)

AMD RevF (with RDTSCP support):
getpid 162 cycles
vgetcpu 145 cycles
vgetcpu rdtscp 32 cycles
vgetcpu cached 14 cycles

Intel Pentium-D (Smithfield):
getpid 719 cycles
vgetcpu 535 cycles
vgetcpu cached 27 cycles

AMD RevE:
getpid 162 cycles
vgetcpu 185 cycles
vgetcpu cached 15 cycles

As you can see CPUID(1) is always very slow, but usually narrowly wins
against the syscall still, except on AMD E stepping. The difference
is very small there and while it would have been possible to implement
a third mode for this that uses a real syscall I ended not too because it
has some other implications.

With the caching mechanism it really flies though and should be fast enough
for most uses.

My eventual hope is that glibc will be start using this to implement a NUMA aware
malloc() in user space that tries to allocate local memory preferably.
I would say that's the biggest gap we still have in "general purpose" NUMA tuning
on Linux. Of course it will be likely useful for a lot of other scalable
code too.

Comments on the general mechanism are welcome. If someone is interested in using
this in user space for SMP or NUMA tuning please let me know.

I haven't quite made of my mind yet if it's 2.6.18 material or not.

-Andi
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Alan Cox
Ar Mer, 2006-06-14 am 09:42 +0200, ysgrifennodd Andi Kleen:
> Comments on the general mechanism are welcome. If someone is interested in using
> this in user space for SMP or NUMA tuning please let me know.

Will 2 words always be enough, it costs nothing to demand 8 or 16 ...

Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Wolfram Gloger
In reply to this post by Andi Kleen
Hi,

> - For NUMA optimization in user space you really need to know the current
> node to find out where to allocate memory from.
...
> My eventual hope is that glibc will be start using this to implement a NUMA aware
> malloc() in user space that tries to allocate local memory preferably.

I'm interested in working on this once the syscall(s) are in place.

Would one need to record the node where an mmap() was performed, and
then use that as a hint when several alternative areas are available
to fulfill a malloc() request?

Hmm, in that case: how about a new /dev/nodemem which would behave
like /dev/zero but supply the node and cache information for the VMA
in a short block at the start of the mapped region (this would be
guranteed to be correct as opposed to what a vgetcpu() would have
given before or after the mmap())?

For ptmalloc2 (current in glibc) such a NUMA optimization should be
rather straightforward, and current ptmalloc3 (www.malloc.de) has made
such an optimization only slightly more difficult, AFAICS.

Regards,
Wolfram.
Reply | Threaded
Open this post in threaded view
|

Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen
On Wednesday 14 June 2006 14:46, Wolfram Gloger wrote:
> Hi,
>
> > - For NUMA optimization in user space you really need to know the current
> > node to find out where to allocate memory from.
> ...
> > My eventual hope is that glibc will be start using this to implement a NUMA aware
> > malloc() in user space that tries to allocate local memory preferably.
>
> I'm interested in working on this once the syscall(s) are in place.

Great.

I'll let you know when it is in.

>
> Would one need to record the node where an mmap() was performed, and
> then use that as a hint when several alternative areas are available
> to fulfill a malloc() request?

Yep.

I would just note on which CPU you ran when you did mmap - the kernel
will usually give you "local" memory and that should be good enough.

You will have to save that information somewhere so that you can
return it to the right pool on free (it can be gotten from the kernel,
but it will be probably too slow).

Or the information can be stored indirectly by looking up to which
pool the address belongs.


> Hmm, in that case: how about a new /dev/nodemem which would behave
> like /dev/zero but supply the node

That can be already gotten from get_mempolicy(). But I wouldn't recommend
to use it. Or at least not in the fast path - it's useful for debugging.

> and cache information

cache only depends on which CPU used it last.

> for the VMA  
> in a short block at the start of the mapped region (this would be
> guranteed to be correct as opposed to what a vgetcpu() would have
> given before or after the mmap())?
>
> For ptmalloc2 (current in glibc) such a NUMA optimization should be
> rather straightforward, and current ptmalloc3 (www.malloc.de) has made
> such an optimization only slightly more difficult, AFAICS.

Nice. I'm sure if you could come up with a prototype it would be
possible to help you benchmark/tune it.

Thanks,

-Andi
 
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Ulrich Drepper
In reply to this post by Andi Kleen
In principle this is a good development.  Ingo and I contemplated the
same for some time.  But there are issues which should be solved.

1.  x86-64 should be converted to the regular vDSO format we use for the
other archs.  These magic addresses are simply unacceptable.  Adding yet
another one makes things only worse.  Outside glibc nobody should use
them so it's just one dependency.  We can control via setarch whether
the old code is available for an interim period.

> long vgetcpu(int *cpu, int *node, unsigned long *tcache)

Do you expect the value returned in *cpu and*node to require an error
value?  If not, then why this fascination with signed types?

And as for the cache: you definitely should use a length parameter.
We've seen in the past over and over again that implicit length
requirements sooner or later fail.

Maybe this is a more commonly needed requirement for the vdso in
feature.  In which case it might be adequate to have a new vdso
functions which returns to total number of bytes needed for caches etc.
 This could be called while setting up the thread stack.  Then we pass a
pointer to an approriately sized local memory region to this call and
others.  The vdso functions will organize among themselves which part of
the buffer each of them can use.  Internally it would be a struct,
initially with only the two longs you currently have in mind.

Not doing it this way would mean that for each new vdso function needing
 TLS memory we would have to modify the very lowlevel TLS structures.
That's a horrible proposition.


So, I suggest adding

long vgettlsreq(void);


which would be implemented using something like this


struct vdso_tls {
  unsigned long getcpu_cache[2];
};

long vgettlsreq(void) { return sizeof (struct vdso_tls); }


and the beginning of vgetcpu would look like this


long vgetcpu(int *cpu, int *node, void *tlsptr) {
  unsigned long *tcache = &((struct vdso_tls *) tlsptr)->getcpu_cache;
  ...
}

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


signature.asc (259 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen
On Wednesday 14 June 2006 17:23, Ulrich Drepper wrote:

> 1.  x86-64 should be converted to the regular vDSO format we use for the
> other archs.  These magic addresses are simply unacceptable.  Adding yet
> another one makes things only worse.  Outside glibc nobody should use
> them so it's just one dependency.  We can control via setarch whether
> the old code is available for an interim period.

I'm not going to break the ABI on this. There are non glibc libcs
and programs that don't use glibc and statically linked program

Eventually we'll need a dynamic format but I'll only add it
for new calls that actually require it for security.
vgetcpu doesn't need it.

>
> > long vgetcpu(int *cpu, int *node, unsigned long *tcache)
>
> Do you expect the value returned in *cpu and*node to require an error
> value?  If not, then why this fascination with signed types?

Shouldn't make a difference.

> And as for the cache: you definitely should use a length parameter.
> We've seen in the past over and over again that implicit length
> requirements sooner or later fail.

No, the cache should be completely opaque to user space. It's just
temporary space for the vsyscall which it cannot store for itself.
I'll probably change it to a struct to make that clearer.

length doesn't make sense for that use.

 
> Maybe this is a more commonly needed requirement for the vdso in
> feature.  In which case it might be adequate to have a new vdso
> functions which returns to total number of bytes needed for caches etc.
>  This could be called while setting up the thread stack.  Then we pass a
> pointer to an approriately sized local memory region to this call and
> others.  The vdso functions will organize among themselves which part of
> the buffer each of them can use.  Internally it would be a struct,
> initially with only the two longs you currently have in mind.

If some other function needs a cache too it can define its own.
I don't see any advantage of using a shared buffer.

>
> Not doing it this way would mean that for each new vdso function needing
>  TLS memory we would have to modify the very lowlevel TLS structures.
> That's a horrible proposition.

I think you're misunderstanding the concept. The vsyscall doesn't
know anything about about TLS. It just gets a pointer pointing to a fixed
size buffer. How that buffer is provided is left to the caller.
glibc would likely use TLS, but the vsyscall code doesn't care.


-Andi
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Ulrich Drepper
Andi Kleen wrote:

> Eventually we'll need a dynamic format but I'll only add it
> for new calls that actually require it for security.
> vgetcpu doesn't need it.

Just introduce the vdso now, add all new vdso calls there.  There is no
reason except laziness to continue with these moronic fixed addresses.
They only get in the way of address space layout change/optimizations.
And nobody said anything about breaking apps which use the fixed
addresses.  That code can still be available.  One should be able to
turn it off with setarch.


>>> long vgetcpu(int *cpu, int *node, unsigned long *tcache)
>> Do you expect the value returned in *cpu and*node to require an error
>> value?  If not, then why this fascination with signed types?
>
> Shouldn't make a difference.

If there is no reason for a signed type none should be used.  It can
only lead to problems.

This reminds me: what are the values for the CPU number?  Are they
continuous?  Are they the same as those used in the affinity syscalls
(they better be)?  With hotplug CPUs, are CPU numbers "recycled"?


>> And as for the cache: you definitely should use a length parameter.
>> We've seen in the past over and over again that implicit length
>> requirements sooner or later fail.
>
> No, the cache should be completely opaque to user space. It's just
> temporary space for the vsyscall which it cannot store for itself.
> I'll probably change it to a struct to make that clearer.
>
> length doesn't make sense for that use.

You didn't even try to understand what I said.  Yes, in this one case
you might at this point in time only need two words.  But

- this might change
- there might be other future functions in the vdso which need memory.
  It is a huge pain to provide more and more of these individual
  variables.  Better allocate one chunk.


> If some other function needs a cache too it can define its own.
> I don't see any advantage of using a shared buffer.

I believe it that _you_ don't see it.  Because the pain is in the libc.
 The code to set up stack frames has to be adjusted for each new TLS
variable.  It is better to do it once in a general way which is what I
suggested.


> I think you're misunderstanding the concept.

No, I understand perfectly.  You don't get it because you don't want to
understand the userlevel side.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


signature.asc (259 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen
On Wednesday 14 June 2006 18:30, Ulrich Drepper wrote:
> > Eventually we'll need a dynamic format but I'll only add it
> > for new calls that actually require it for security.
> > vgetcpu doesn't need it.
>
> Just introduce the vdso now, add all new vdso calls there.  There is no
> reason except laziness to continue with these moronic fixed addresses.
> They only get in the way of address space layout change/optimizations.

The user address space size on x86-64 is final (baring the architecture gets extended
to beyond 48bit VA). We already use all positive
space. But the vsyscalls don't even live in user address space.

> >>> long vgetcpu(int *cpu, int *node, unsigned long *tcache)
> >> Do you expect the value returned in *cpu and*node to require an error
> >> value?  If not, then why this fascination with signed types?
> >
> > Shouldn't make a difference.
>
> If there is no reason for a signed type none should be used.  It can
> only lead to problems.

Ok i can change it to unsigned if you feel that strongly about it.

>
> This reminds me: what are the values for the CPU number?  Are they
> continuous?  Are they the same as those used in the affinity syscalls
> (they better be)?  

Yes of course.

> With hotplug CPUs, are CPU numbers "recycled"?

I think if the same CPU gets unplugged and replugged it should
get the same number. Otherwise new numbers should be allocated.


> Yes, in this one case
> you might at this point in time only need two words.  But
>
> - this might change

Alan suggested adding some padding which probably
makes sense, although I frankly don't see the implementation
changing.  Variable length would be clear overkill and I refuse
to overdesign this.

> - there might be other future functions in the vdso which need memory.
>   It is a huge pain to provide more and more of these individual
>   variables.  Better allocate one chunk.

Why is it a problem? It's just var __thread isn't it?
 
>
> > If some other function needs a cache too it can define its own.
> > I don't see any advantage of using a shared buffer.
>
> I believe it that _you_ don't see it.  Because the pain is in the libc.
>  The code to set up stack frames has to be adjusted for each new TLS
> variable.  It is better to do it once in a general way which is what I
> suggested.

Hmm, I thought user space could define arbitary own __threads. I certainly
used that in some of my code. Why is it a problem for the libc to do the same?

Anyways even if it's such a big problem you can put it all in
one chunk and partition it yourself given the fixed size. I don't think
the kernel code should concern itself about this.

-Andi
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Luck, Tony
In reply to this post by Andi Kleen
On 6/14/06, Andi Kleen <[hidden email]> wrote:
> But at a closer look it really makes sense:
> - The kernel has strong thread affinity and usually keeps a process on the
> same CPU. So switching CPUs is rare. This makes it an useful optimization.

Alternatively it means that this will almost always do the right thing, but
once in a while it won't, your application will happen to have been migrated
to a different cpu/node at the point it makes the call, and from then on
this instance will behave oddly (running slowly because it allocates most
of its memory on the wrong node).  When you try to reproduce the problem,
the application will work normally.

> The alternative is usually to bind the process to a specific CPU - then it
> "know" where it is - but the problem is that this is nasty to use and
> requires user configuration. The kernel often can make better decisions on
> where to schedule. And doing it automatically makes it just work.

Another alternative would be to provide a mechanism for a process
to bind to the current cpu (whatever cpu that happens to be).  Then
the kernel gets to make the smart placement decisions, and processes
that want to be bound somewhere (but don't really care exactly where)
have a way to meet their need.  Perhaps a cpumask of all zeroes to a
sched_setaffinity call could be overloaded for this?

Or we can dig up some of the old virtual cpu/virtual node suggestions (we
will eventually need to do something like this, but most systems now don't
have enough cpus for this to make much sense yet).

-Tony
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen
On Thursday 15 June 2006 20:44, Tony Luck wrote:

> On 6/14/06, Andi Kleen <[hidden email]> wrote:
> > But at a closer look it really makes sense:
> > - The kernel has strong thread affinity and usually keeps a process on the
> > same CPU. So switching CPUs is rare. This makes it an useful optimization.
>
> Alternatively it means that this will almost always do the right thing, but
> once in a while it won't, your application will happen to have been migrated
> to a different cpu/node at the point it makes the call, and from then on
> this instance will behave oddly (running slowly because it allocates most
> of its memory on the wrong node).  When you try to reproduce the problem,
> the application will work normally.

That's inherent in NUMA. No good way around that.

We have a similar problem with caches because we don't color them. People
have learned to live with it.
 

> > The alternative is usually to bind the process to a specific CPU - then it
> > "know" where it is - but the problem is that this is nasty to use and
> > requires user configuration. The kernel often can make better decisions on
> > where to schedule. And doing it automatically makes it just work.
>
> Another alternative would be to provide a mechanism for a process
> to bind to the current cpu (whatever cpu that happens to be).  Then
> the kernel gets to make the smart placement decisions, and processes
> that want to be bound somewhere (but don't really care exactly where)
> have a way to meet their need.  Perhaps a cpumask of all zeroes to a
> sched_setaffinity call could be overloaded for this?

I tried something like this a few years ago and it just didn't work
(or rather ran usually slower) The scheduler would select a home node at startup and
then try to move the process there.

The problem is that not using a CPU costs you much more than whatever
overhead you get from using non local memory.

So by default filling the CPUs must be the highest priority and memory
policy cannot interfere with that.
 
-Andi
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Gerd Hoffmann
Andi Kleen wrote:
>> Alternatively it means that this will almost always do the right thing, but
>> once in a while it won't, your application will happen to have been migrated
>> to a different cpu/node at the point it makes the call, and from then on
>> this instance will behave oddly (running slowly because it allocates most
>> of its memory on the wrong node).  When you try to reproduce the problem,
>> the application will work normally.
>
> That's inherent in NUMA. No good way around that.

Hmm, maybe it makes sense to allow binding memory areas to threads
instead of nodes.  That way the kernel may attempt to migrate the pages
to another node in case it migrates threads / processes.  Either via
mbind(), or maybe better via madvise() to make clear it's a hint only.

cheers,

  Gerd

--
Gerd Hoffmann <[hidden email]>
http://www.suse.de/~kraxel/julika-dora.jpeg
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen
On Friday 16 June 2006 09:23, Gerd Hoffmann wrote:

> Andi Kleen wrote:
> >> Alternatively it means that this will almost always do the right thing, but
> >> once in a while it won't, your application will happen to have been migrated
> >> to a different cpu/node at the point it makes the call, and from then on
> >> this instance will behave oddly (running slowly because it allocates most
> >> of its memory on the wrong node).  When you try to reproduce the problem,
> >> the application will work normally.
> >
> > That's inherent in NUMA. No good way around that.
>
> Hmm, maybe it makes sense to allow binding memory areas to threads
> instead of nodes.  That way the kernel may attempt to migrate the pages
> to another node in case it migrates threads / processes.  Either via
> mbind(), or maybe better via madvise() to make clear it's a hint only.

I haven't tried that but I have talked to others who tried to implement
automatic page migration and they say they couldn't make that work (or rather
make it a win) either.

-Andi
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Jes Sorensen-3
In reply to this post by Andi Kleen
>>>>> "Andi" == Andi Kleen <[hidden email]> writes:

Andi> On Thursday 15 June 2006 20:44, Tony Luck wrote:
>> Another alternative would be to provide a mechanism for a process
>> to bind to the current cpu (whatever cpu that happens to be).  Then
>> the kernel gets to make the smart placement decisions, and
>> processes that want to be bound somewhere (but don't really care
>> exactly where) have a way to meet their need.  Perhaps a cpumask of
>> all zeroes to a sched_setaffinity call could be overloaded for
>> this?

Andi> I tried something like this a few years ago and it just didn't
Andi> work (or rather ran usually slower) The scheduler would select a
Andi> home node at startup and then try to move the process there.

Andi> The problem is that not using a CPU costs you much more than
Andi> whatever overhead you get from using non local memory.

It all depends on your application and the type of system you are
running on. What you say applies to smaller cpu counts. However once
we see the upcoming larger count multi-core cpus become commonly
available, this is likely to change and become more like what is seen
today on larger NUMA systems.

In the scientific application space, there are two very common
groupings of jobs. One is simply a large threaded application with a
lot of intercommunication, often via MPI. In many cases one ends up
running a job on just a subset of the system, in which case you want
to see threads placed on the same node(s) to minimize internode
communication. It is desirable to either force the other tasks on the
system (system daemons etc) onto other node(s) to reduce noise and
there could also be space to run another parallel job on the remaining
node(s).

The other common case is to have jobs which spawn off a number of
threads that work together in groups (via OpenMP). In this case you
would like to have all your OpenMP threads placed on the same node for
similar reasons.

Not getting this right can result in significant loss of performance
for jobs which are highly memory bound or rely heavily on
intercommunication and synchronization.

Andi> So by default filling the CPUs must be the highest priority and
Andi> memory policy cannot interfere with that.

I really don't think this approach is going to solve the problem. As
Tony also points out, tasks will eventually migrate. The user needs to
tell the kernel where it wants to run the tasks rather than the kernel
telling the task where it is located. Only the application (or
developer/user) knows how the threads are expected to behave, doing
this automatically is almost never going to be optimal. Obviously the
user needs visibility of the topology of the machine to do so but that
should be available on any NUMA system through /proc or /sys.

In the scientific space the jobs are often run repeatedly with new
data sets every time, so it is worthwhile to spend the effort up front
to get the placement right. One-off runs are obviously something else
and there your method is going to be more beneficial.

IMHO, what we really need is a more advanced way for user applications
to hint at the kernel how to place it's threads.

Cheers,
Jes
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen

> It all depends on your application and the type of system you are
> running on. What you say applies to smaller cpu counts. However once
> we see the upcoming larger count multi-core cpus become commonly
> available, this is likely to change and become more like what is seen
> today on larger NUMA systems.

Maybe. Maybe not.

>
> In the scientific application space, there are two very common
> groupings of jobs.

The scientific users just use pinned CPUs and seem to be happy with that.
They also have cheap slav^wgrade students to spend lots of time on
manual tuning.  I'm not concerned about them.

If you already use CPU affinity you should already know where you are and don't
need this call at all.

So this clearly isn't targetted for them.

Interesting is getting the best performance from general purpose applications
without any special tuning. For them I'm trying to improve things.

Number one applications currently are databases and JVMs. I hope with
Wolfam's malloc work it will be useful for more applications too.

> Andi> So by default filling the CPUs must be the highest priority and
> Andi> memory policy cannot interfere with that.
>
> I really don't think this approach is going to solve the problem. As
> Tony also points out, tasks will eventually migrate.

Currently we don't solve this problem with the standard heuristics.
It can be solved with manual tuning (mempolicy, explicit CPU affinity)
but if you're doing that you're already out side the primary use
case of vgetcpu().

vgetcpu() is only trying to be a incremental improvement of the current
simple default local policy.

> The user needs to

Scientific users do that, but other users normally not. I doubt that
is going to change.

-Andi
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Jes Sorensen-3
Andi Kleen wrote:
>> In the scientific application space, there are two very common
>> groupings of jobs.
>
> The scientific users just use pinned CPUs and seem to be happy with that.
> They also have cheap slav^wgrade students to spend lots of time on
> manual tuning.  I'm not concerned about them.

Do they? There's a lot of scientific sites out there which are not
universities or research organizations. They do not have free slave
labour at hand. A lot of users fall into this category, especially the
users with larger systems or large clusters (be it ia64, x86_64 or PPC).

> If you already use CPU affinity you should already know where you are and don't
> need this call at all.

Except that whats currently available isn't sufficient to do what is
needed.

> So this clearly isn't targetted for them.
>
> Interesting is getting the best performance from general purpose applications
> without any special tuning. For them I'm trying to improve things.

Well I am interested in getting the best performance for some of the
same applications, without having to modify them. The current affinity
support simply isn't sufficient for that. Placement has to be targetted
at launch time since thread implementations can change the layout etc.

> Number one applications currently are databases and JVMs. I hope with
> Wolfam's malloc work it will be useful for more applications too.

If you want this to work for general purpose applications, then how is
this new syscall going to help? If you expect application vendors to
code for it, that means few users will benefit.

>> I really don't think this approach is going to solve the problem. As
>> Tony also points out, tasks will eventually migrate.
>
> Currently we don't solve this problem with the standard heuristics.
> It can be solved with manual tuning (mempolicy, explicit CPU affinity)
> but if you're doing that you're already out side the primary use
> case of vgetcpu().

This is another area where the kernel could do better by possibly using
the cpumask to determine where it will allocate memory.

> vgetcpu() is only trying to be a incremental improvement of the current
> simple default local policy.

As Tony rightfully pointed out, tasks do migrate. By making this guess
initially and then expecting the application to run for a long time,
you will end up with it having zero or possibly a negative effect.

>> The user needs to
>
> Scientific users do that, but other users normally not. I doubt that
> is going to change.

I just use scientific users since thats where I have the most recent
detailed data from. Databases could well benefit from what I mentioned,
though the serious ones would want to look into using affinity support
explicitly in their code.

Cheers,
Jes
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Andi Kleen

> The current affinity
> support simply isn't sufficient for that. Placement has to be targetted
> at launch time since thread implementations can change the layout etc.

I'm not sure how that's related to vgetcpu, but ok ...

In general if you want to affect placement below the process / shared memory
segment level you should change the application.

Anything else just results in a big messy and unreliable and fragile user
command line interface - a quick look at the respective Irix manpage should
make that clear.
 
> > Number one applications currently are databases and JVMs. I hope with
> > Wolfam's malloc work it will be useful for more applications too.
>
> If you want this to work for general purpose applications, then how is
> this new syscall going to help?

It will improve their malloc(). They don't know anything about NUMA,
but getting local memory will help them. They already get local
memory now from the kernel when they use big allocations, but
for smaller allocations it doesn't work because the kernel can't
give out anything smaller than a page. This would be solved
by a NUMA aware malloc, but it needs vgetcpu() for this if it
should work without fixed CPU affinity.

Basically it is just for extending the existing already used proven etc.
default local policy to sub pages. Also there might be other uses
of it too (like per CPU data), although I expect most use of that
in user space can be already done using TLS.

JVM and databases will use it too, but since they often use their
own allocators they will need to be modified.

> If you expect application vendors to
> code for it, that means few users will benefit.

Most applications use malloc()
 

> >> I really don't think this approach is going to solve the problem. As
> >> Tony also points out, tasks will eventually migrate.
> >
> > Currently we don't solve this problem with the standard heuristics.
> > It can be solved with manual tuning (mempolicy, explicit CPU affinity)
> > but if you're doing that you're already out side the primary use
> > case of vgetcpu().
>
> This is another area where the kernel could do better by possibly using
> the cpumask to determine where it will allocate memory.

Modify fallback lists based on cpu affinity?

Would get messy in the code because you couldn't easily precompute
them anymore.


But cpusets already does this kind of, even though it has a quite
bad impact on fast paths.
 Also what happens if the affinity mask is modified later?
From the high semantics point it is also a little dubious to mesh
them together. My feeling is that as a heuristic it is probably
dubious.

Also when you set cpu affinity you can as well set memory
policy iit.


>
> > vgetcpu() is only trying to be a incremental improvement of the current
> > simple default local policy.
>
> As Tony rightfully pointed out, tasks do migrate. By making this guess
> initially

The gamble is already there in the local policy. No change at all.
When you already got local memory you can use it better with
vgetcpu() though.

From our experience it works out in most cases though - in general
most benchmarks show better performance with simple local NUMA
policy than SMP mode or no policy.

In the cases where it doesn't you have to either eat the slow
down or use manual tuning.

> I just use scientific users since thats where I have the most recent
> detailed data from. Databases could well benefit from what I mentioned,
> though the serious ones would want to look into using affinity support
> explicitly in their code.

No exactly not - i got requests from "serious" databases to offer
vgetcpu() because affinity is too complicated to configure and manage.

It sounds like you want to solve NUMA world hunger here, not
concentrate on the specific small incremental improvement vgetcpu is trying
to offer.

I'm sure there is much research that could be done in the general NUMA
tuning area, but I would suggest making it research with numbers first
before trying to hack like this anything into the kernel without
a clear understanding first.

-Andi

Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Jes Sorensen-3
Andi Kleen wrote:
>> The current affinity
>> support simply isn't sufficient for that. Placement has to be targetted
>> at launch time since thread implementations can change the layout etc.
>
> I'm not sure how that's related to vgetcpu, but ok ...
>
> In general if you want to affect placement below the process / shared memory
> segment level you should change the application.

That would be great, except that a lot of these applications are
'standard' applications which people they don't write themselves.
Sometimes the sourcecode is no longer available. We could argue that
people should just rewrite their applications, but in reality this isn't
whats happening.

> It will improve their malloc(). They don't know anything about NUMA,
> but getting local memory will help them. They already get local
> memory now from the kernel when they use big allocations, but
> for smaller allocations it doesn't work because the kernel can't
> give out anything smaller than a page. This would be solved
> by a NUMA aware malloc, but it needs vgetcpu() for this if it
> should work without fixed CPU affinity.

I really don't see the benefit here. malloc already gets pages handed
down from the kernel which are node local due to them being assigned at
a first touch basis. I am not sure about glibc's malloc internals, but
rather rely on a vgetcpu() call, all it really needs to do is to keep
a thread local pool which will automatically get it's thing locally
through first touch usage.

I don't see how a new syscall is going to provide anything to malloc
that it doesn't already have. What am I missing?

> Basically it is just for extending the existing already used proven etc.
> default local policy to sub pages. Also there might be other uses
> of it too (like per CPU data), although I expect most use of that
> in user space can be already done using TLS.

The thread libraries already have their own thread local area which
should be allocated on the thread's own node if done right, which I
assume it is.

> JVM and databases will use it too, but since they often use their
> own allocators they will need to be modified.

I would assume the real databases to be smart enough to benefit from
things being first touch already. JVMs .... well who knows, can't say
I have a lot of faith in anything running in a JVM :)

>> If you expect application vendors to
>> code for it, that means few users will benefit.
>
> Most applications use malloc()

Which doesn't need the vgetcpu() call as far as I can see.

>> This is another area where the kernel could do better by possibly using
>> the cpumask to determine where it will allocate memory.
>
> Modify fallback lists based on cpu affinity?

It's a hint, not guaranteed placement. You have the same problem if you
try to allocate memory on a node and there's nothing left there.

> But cpusets already does this kind of, even though it has a quite
> bad impact on fast paths.
>  Also what happens if the affinity mask is modified later?
> From the high semantics point it is also a little dubious to mesh
> them together. My feeling is that as a heuristic it is probably
> dubious.

If you migrate your app elsewhere, you should migrate the pages with it,
or not expect things to run with the local effect.

> The gamble is already there in the local policy. No change at all.
> When you already got local memory you can use it better with
> vgetcpu() though.
>
> From our experience it works out in most cases though - in general
> most benchmarks show better performance with simple local NUMA
> policy than SMP mode or no policy.

Could you share some information about the type of benchmarks?

>> I just use scientific users since thats where I have the most recent
>> detailed data from. Databases could well benefit from what I mentioned,
>> though the serious ones would want to look into using affinity support
>> explicitly in their code.
>
> No exactly not - i got requests from "serious" databases to offer
> vgetcpu() because affinity is too complicated to configure and manage.
>
> It sounds like you want to solve NUMA world hunger here, not
> concentrate on the specific small incremental improvement vgetcpu is trying
> to offer.

I don't really see the point in solving something half way when it can
be done better. Maybe the "serious" databases should open up and let us
know what the problem is they are hitting.

> I'm sure there is much research that could be done in the general NUMA
> tuning area, but I would suggest making it research with numbers first
> before trying to hack like this anything into the kernel without
> a clear understanding first.

Well I did spend a good chunk of time looking at some of this some time
ago and did speek a lot to one of my colleagues who actually runs
benchmarks using some of these tools to understand the impact. If
anything it seems that vgetcpu is the issue that is still in the
research stage.

Cheers,
Jes
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Wolfram Gloger
Hi,

> I really don't see the benefit here. malloc already gets pages handed
> down from the kernel which are node local due to them being assigned at
> a first touch basis.

That's only true for the very first allocation!  malloc is all about
"recycling" memory from freed previous allocations.

> I am not sure about glibc's malloc internals, but
> rather rely on a vgetcpu() call, all it really needs to do is to keep
> a thread local pool which will automatically get it's thing locally
> through first touch usage.

When a malloc() call is serviced, there are often several
possibilities from where to fulfill the request.
ptmalloc (current in glibc) has several pools (arenas)
but currently can not discern them in any way except that
of course it prefers the last used one.

> I don't see how a new syscall is going to provide anything to malloc
> that it doesn't already have. What am I missing?
 
With vgetcpu() one could easily record the node that allocated the
arena _and_ use that for later malloc() calls in case there is freed
memory available within an arena.

Regards,
Wolfram.
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Zoltan Menyhart
In reply to this post by Jes Sorensen-3
Just to make sure I understand it correctly...
Assuming I have allocated per CPU data (numa control, etc.) pointed at by:

        void *per_cpu[MAXCPUS];

Assuming a per CPU variable has got an "offset" in each per CPU data area.
Accessing this variable can be done as follows:

        err = vgetcpu(&my_cpu, ...);
        if (err)
                goto ....
        pointer = (typeof pointer) (per_cpu[my_cpu] + offset);
        // use "pointer"...

It is hundred times more long than "__get_per_cpu(var)++".

As we do not know when we can be moved to another CPU,
"vgetcpu()" has to be called again after a "reasonable short" time.

My idea is to map the current task structure at an arch. dependent
virtual address into the user space (obviously in RO).

        #define current ((struct task_struct *) 0x...)

No more need to for "vgetcpu()" at all. The example above becomes:

        pointer = (typeof pointer) (per_cpu[current->thread_info.cpu] + offset);
        // use "pointer"...

As obtaining "pointer" does not cost much, it can be re-calculated at
each usage => no problem to know when to recheck it, there is less chance for
using the data of a neighbor.

Regards,

Zoltan Menyhart
Reply | Threaded
Open this post in threaded view
|

Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

Jes Sorensen-3
Zoltan Menyhart wrote:
> Just to make sure I understand it correctly...
> Assuming I have allocated per CPU data (numa control, etc.) pointed at by:

I think you misunderstood - vgetcpu is for userland usage, not within
the kernel.

Cheers,
Jes
123