Re: Speeding up the dynamic linker with 100s of DSOs?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

michael meeks
Hi Andrew,

        So - we have the same problem in OO.o - but we have 'only' 150DSOs
loaded instead of 300 almost all dlopened.

        I have several patches addressing this posted to the binutils list -
particularly implementing -Bdirect linking (which stores the
compile-link-time information as to which library a symbol is defined
in) so that we can look there 1st at run-time. That of course, removes
75% or so of the O(num-libraries) lookups from the link[1].

        Of course - a facile implementation of -Bdirect would have serious
problems with C++ - because of the vague linkage / ODR issues - so, the
patch detects which symbols are vague during library compile/link, then
propagates that information to other libraries, so life is good.

        In addition; by sorting relocations & dynsym & dynstr entries one can
get another ~25% win, and by storing & comparing pre-computed hash
values one can get a ~50% win - so, all in all if we could get this
stuff into general glibcs[2] ;-), performance of large, modularized C++
apps would get a lot better. [ there are 2 final, more maginal tweaks
queued pending this getting rationalized too :-].

        Check:

-Bdirect: http://sourceware.org/ml/binutils/2005-11/msg00380.html
sorting: http://sourceware.org/ml/binutils/2006-01/msg00024.html
.hashvals http://sourceware.org/ml/binutils/2006-01/msg00171.html

        The latest patches .suse.ized are at:

        http://go-oo.org/ooo-build/patches/test/*suse*.diff

        As one blogs about the work, I've had the Troll-tech guys & a random
ISV play with various features - suffering the same problems. I'd love
to know what your experience of testing these is - that is if you are ok
with re-building your glibc :-)

        HTH,

                Michael.

[1] - the other 25% of relocations in my case being vague type info.
[2] - SL 10.1 should ship with some of this, though it's not there yet.
--
 [hidden email]  <><, Pseudo Engineer, itinerant idiot

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Andrew Chatham
What I've ended up doing is this:

After the call to _dl_map_object_deps(), when all of the required
objects have been loaded, I build up a table. For every symbol, I
record the earliest position in the global scope where that symbol
occurs. I just store the minimum in table[hash(symbol_name)] rather
than having a real hash table with chaining.

Then when a symbol is being looked up in the global scope, I look it
up in my hash table and skip to that object. So it's O(#symbols)
instead of O(#symbols * #DSOs), speeding up a few of my horrible test
cases by 95%. Even at 5% of the previous time, they're still slow. So
I may try your patches at some point, though I'm having enough trouble
getting this through at the moment.

I just left dlopened objects alone, since we don't use dlopen much. So
their symbols get handled the slow way, but it wouldn't be too hard to
handle them in the same way.

Still waiting on the copyright assignment, so the patch still has to
wait. Also, it looks terrible :)

Andrew

On 1/25/06, michael meeks <[hidden email]> wrote:

> Hi Andrew,
>
>         So - we have the same problem in OO.o - but we have 'only' 150DSOs
> loaded instead of 300 almost all dlopened.
>
>         I have several patches addressing this posted to the binutils list -
> particularly implementing -Bdirect linking (which stores the
> compile-link-time information as to which library a symbol is defined
> in) so that we can look there 1st at run-time. That of course, removes
> 75% or so of the O(num-libraries) lookups from the link[1].
>
>         Of course - a facile implementation of -Bdirect would have serious
> problems with C++ - because of the vague linkage / ODR issues - so, the
> patch detects which symbols are vague during library compile/link, then
> propagates that information to other libraries, so life is good.
>
>         In addition; by sorting relocations & dynsym & dynstr entries one can
> get another ~25% win, and by storing & comparing pre-computed hash
> values one can get a ~50% win - so, all in all if we could get this
> stuff into general glibcs[2] ;-), performance of large, modularized C++
> apps would get a lot better. [ there are 2 final, more maginal tweaks
> queued pending this getting rationalized too :-].
>
>         Check:
>
> -Bdirect:       http://sourceware.org/ml/binutils/2005-11/msg00380.html
> sorting:        http://sourceware.org/ml/binutils/2006-01/msg00024.html
> hashvals        http://sourceware.org/ml/binutils/2006-01/msg00171.html
>
>         The latest patches .suse.ized are at:
>
>         http://go-oo.org/ooo-build/patches/test/*suse*.diff
>
>         As one blogs about the work, I've had the Troll-tech guys & a random
> ISV play with various features - suffering the same problems. I'd love
> to know what your experience of testing these is - that is if you are ok
> with re-building your glibc :-)
>
>         HTH,
>
>                 Michael.
>
> [1] - the other 25% of relocations in my case being vague type info.
> [2] - SL 10.1 should ship with some of this, though it's not there yet.
> --
>  [hidden email]  <><, Pseudo Engineer, itinerant idiot
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

michael meeks
Hi Andrew,

On Thu, 2006-01-26 at 14:43 -0800, Andrew Chatham wrote:
> After the call to _dl_map_object_deps(), when all of the required
> objects have been loaded, I build up a table. For every symbol, I
> record the earliest position in the global scope where that symbol
> occurs. I just store the minimum in table[hash(symbol_name)] rather
> than having a real hash table with chaining.

        Ah; ok :-) so - this would burn a chunk of memory for OO.o; we have
~700k individual .dynsym entries kicking around. Of course, with my
-Bdirect approach, we know which symbols are 'vague' and have to be
looked up globally [ ie. ~none for C, and ~25% of my C++ ], and build a
separate hash just for them.

>  Even at 5% of the previous time, they're still slow. So
> I may try your patches at some point, though I'm having enough trouble
> getting this through at the moment.

        Sure - so, you really need to re-compile glibc with -Wl,-hashvals, but
*not* -Wl,-Bdirect; and compile libstdc++ with -Wl,-Bdirect to see a
non-trivial win.

        Interestingly glibc itself (wrt. -lpthread) is the -only- valid/useful
instance of interposing my 'finterpose' tool detects on my system [ only
comparing all DSOs so far ]. OTOH - I've caught 10's of bugs & broken
symbol exports by looking for other examples across the system [ most of
them reported ]. ELF's/Interposing looks to me like an difficult to
justify nightmare - used deliberately only in 1 instance, but used in
error in many others causing unexpected behavior.

        HTH,

                Michael.

--
 [hidden email]  <><, Pseudo Engineer, itinerant idiot

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Nick Alcock-2
On Fri, 27 Jan 2006, michael meeks spake:

> On Thu, 2006-01-26 at 14:43 -0800, Andrew Chatham wrote:
>>  Even at 5% of the previous time, they're still slow. So
>> I may try your patches at some point, though I'm having enough trouble
>> getting this through at the moment.
>
> Sure - so, you really need to re-compile glibc with -Wl,-hashvals, but
> *not* -Wl,-Bdirect; and compile libstdc++ with -Wl,-Bdirect to see a
> non-trivial win.
>
> Interestingly glibc itself (wrt. -lpthread) is the -only- valid/useful
> instance of interposing my 'finterpose' tool detects on my system [ only
> comparing all DSOs so far ].

Interposing via LD_PRELOAD is more common. dmalloc could work that way
(but doesn't).

>                  ELF's/Interposing looks to me like an difficult to
> justify nightmare - used deliberately only in 1 instance, but used in
> error in many others causing unexpected behavior.

Symbol interposition is a lovely *idea* if you have a library with a
strict ABI and want to alter its behaviour from other libraries.

However those other libraries had better be closely tied to the first
library, or the first library had better be designed for interposition
(as glibc is), or the interposition will just confuse the first library,
or at least gain brittle dependencies on exactly what the functions you
are interposing happen to do; and if the first library is designed for
it, it can generally use an application-level fix like callbacks or
something of that nature. And it has horrible consequences for
performance and definite negative consequences for reliability, as you
note.

I think perhaps interposition should be something like PT_GNU_STACK,
turnable-on by ELF binaries that explicitly declare that they need it
(and when it's turned on it obviously `contaminates' the whole process);
I'd expect the vast number of binaries to not declare anything of the
kind. (Obviously, using LD_PRELOAD would also enable interposition.
LD_PRELOAD is definitely the instance of interposition that would have
people screaming if we took it away, given how many nifty debugging
hacks and diagnostic tricks it enables.)

But alas that's not how ELF works...

--
`I won't make a secret of the fact that your statement/question
 sent a wave of shock and horror through us.' --- David Anderson
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

michael meeks

On Fri, 2006-01-27 at 12:11 +0000, Nix wrote:
> Interposing via LD_PRELOAD is more common. dmalloc could work that way
> (but doesn't).

        Sure.

> Symbol interposition is a lovely *idea* if you have a library with a
> strict ABI and want to alter its behaviour from other libraries.

        Sure - as an idea it's certainly elegant.

>  And it has horrible consequences for performance and definite negative
> consequences for reliability, as you note.

        Quite.

> I think perhaps interposition should be something like PT_GNU_STACK,
> turnable-on by ELF binaries that explicitly declare that they need it
> (and when it's turned on it obviously `contaminates' the whole process);

        Well - of course, my -Bdirect implementation provides a way to do
direct (non-interposing) linkage, in a way that honors linking with a
mix of old & new style libraries, and handles the tricky C++ vague
linkage issues - so it's nearly what you ask for. Clearly though you
check for DT_DIRECT rather than the absence of something else. Of course
on Solaris the (somewhat different) -Bdirect support is there as
standard & widely used.

> (Obviously, using LD_PRELOAD would also enable interposition.

        Sure - as soon as you LD_PRELOAD we turn all -Bdirect stuff off - of
course, it'd be trivial to optimize that, but would require some
re-factoring, that'd look unpleasant in patch form.

> LD_PRELOAD is definitely the instance of interposition that would have
> people screaming if we took it away, given how many nifty debugging
> hacks and diagnostic tricks it enables.)

        Of course /me uses it himself all the time.

        HTH,

                Michael.

--
 [hidden email]  <><, Pseudo Engineer, itinerant idiot

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Nick Alcock-2
On Fri, 27 Jan 2006, michael meeks stipulated:

>
> On Fri, 2006-01-27 at 12:11 +0000, Nix wrote:
>> I think perhaps interposition should be something like PT_GNU_STACK,
>> turnable-on by ELF binaries that explicitly declare that they need it
>> (and when it's turned on it obviously `contaminates' the whole process);
>
> Well - of course, my -Bdirect implementation provides a way to do
> direct (non-interposing) linkage, in a way that honors linking with a
> mix of old & new style libraries, and handles the tricky C++ vague
> linkage issues - so it's nearly what you ask for.

Yes; it seems like a very nifty piece of work to me.

>                                                   Clearly though you
> check for DT_DIRECT rather than the absence of something else. Of course

Absolutely! This isn't like the read-only stack stuff where you had to
assume that the stack was writable if not otherwise specified, to avoid
breaking old stuff: almost nothing relies on symbol interposition, so
it's reasonable to turn it off by default.

(Actually Oracle's precompiler runtime libraries rely on it, but if I
told you what they try to do in regard of `stub libraries' exporting the
same symbols as glibc, you'd be violently ill, so I won't. In any case,
their foul hack doesn't work and is routinely disabled at sites using
said products, so going to any effort to avoid breaking it is futile.)

> on Solaris the (somewhat different) -Bdirect support is there as
> standard & widely used.

Yep, and it hasn't caused the world to end nor people who need
interposition to rise in armed rebellion. A good sign.

>> (Obviously, using LD_PRELOAD would also enable interposition.
>
> Sure - as soon as you LD_PRELOAD we turn all -Bdirect stuff off - of
> course, it'd be trivial to optimize that, but would require some
> re-factoring, that'd look unpleasant in patch form.

I doubt it's *worth* optimizing. Sure, let's avoid slowing it down, but
going to great lengths to speed it up will just encourage people to use
/etc/ld.so.preload or something, and then Ulrich will have an infarction
and it'll all be your fault. :)

I suppose there have been non-debugging uses of it, like that hack which
enabled Xft usage on GNOME 1, but, again, are these significant enough to
complicate already-complex speed-and-correctness-critical code trying to
speed up?

(I've also used it as crude ways to script some non-scriptable programs;
if I wanted something done before the Nth call of some preload-hookable
function... but I think I've done that twice in ten years. It's just
not that significant.)

>> LD_PRELOAD is definitely the instance of interposition that would have
>> people screaming if we took it away, given how many nifty debugging
>> hacks and diagnostic tricks it enables.)
>
> Of course /me uses it himself all the time.

Oh good. I don't have to worry about you breaking the actually *useful*
uses for symbol interposition then :)

--
`I won't make a secret of the fact that your statement/question
 sent a wave of shock and horror through us.' --- David Anderson