Speeding up the dynamic linker with 100s of DSOs?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Speeding up the dynamic linker with 100s of DSOs?

Andrew Chatham
Hi,

We have some binaries that get linked with an absurd number of dynamic
shared objects (up to 700 or so), and it can take a very long time for
these programs to execute. It can take 30 seconds just to get into
main(), because it takes so long to resolve the global constructors.
Each symbol resolution goes through the list of DSOs, testing each
hash table, until it resolves the symbol.

We could improve this in our code by making more symbols static or by
collapsing the DSOs, but those are difficult to push through, so we've
been looking at the dynamic linker. It sounds like other people have
had similar problems, too.

One of my colleagues, Michael Chastain, tried increasing the number of
buckets in the ELF hash table in the DSOs, and that improved things
somewhat. Then I tried computing bloom filters for the DSOs as they
were loaded by the dynamic linker, which seemed to be speed things up
a bit more.

At this point, it takes 7 seconds to reach main() instead of 30, which
still isn't so great, but it's a start. At this point, my patch is
messy and leaks memory and may not work with dlopen, and we haven't
gotten our copyright agreement sorted out yet, so I can't share the
code right now. But I wanted to vet some ideas before continuing with
it.

Even with the bloom filters, resolving a symbol is still O(n) in the
number of DSOs; I just improved the constant factor. I had some ideas
for speeding up resolution against linked-in libraries (we'd just
handle dlopened DSOs the slow way), but I didn't figure out when the
dynamic linker is done loading those. If I had that, I could either
build some larger filters to allow for a binary search or just build
one big hash table over all linked-in DSOs and search against that.

Do any of these approaches sound reasonable? Separate bloom filters
per DSO (safe, almost done)? Hierarchical filters to allow binary
search? One big hash table?

Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Andrew Chatham
On 1/4/06, Andrew Chatham <[hidden email]> wrote:

> Hi,
>
> We have some binaries that get linked with an absurd number of dynamic
> shared objects (up to 700 or so), and it can take a very long time for
> these programs to execute. It can take 30 seconds just to get into
> main(), because it takes so long to resolve the global constructors.
> Each symbol resolution goes through the list of DSOs, testing each
> hash table, until it resolves the symbol.
>
> We could improve this in our code by making more symbols static or by
> collapsing the DSOs, but those are difficult to push through, so we've
> been looking at the dynamic linker. It sounds like other people have
> had similar problems, too.
>
> One of my colleagues, Michael Chastain, tried increasing the number of
> buckets in the ELF hash table in the DSOs, and that improved things
> somewhat. Then I tried computing bloom filters for the DSOs as they
> were loaded by the dynamic linker, which seemed to be speed things up
> a bit more.
>
> At this point, it takes 7 seconds to reach main() instead of 30, which
> still isn't so great, but it's a start. At this point, my patch is
> messy and leaks memory and may not work with dlopen, and we haven't
> gotten our copyright agreement sorted out yet, so I can't share the
> code right now. But I wanted to vet some ideas before continuing with
> it.
>
> Even with the bloom filters, resolving a symbol is still O(n) in the
> number of DSOs; I just improved the constant factor. I had some ideas
> for speeding up resolution against linked-in libraries (we'd just
> handle dlopened DSOs the slow way), but I didn't figure out when the
> dynamic linker is done loading those. If I had that, I could either
> build some larger filters to allow for a binary search or just build
> one big hash table over all linked-in DSOs and search against that.
>
> Do any of these approaches sound reasonable? Separate bloom filters
> per DSO (safe, almost done)? Hierarchical filters to allow binary
> search? One big hash table?
>
> Andrew
And here's a shell script that demonstrates the problem. It compiles
700 DSOs with some global objects and then links them together. Takes
10 seconds user time with the stock 2.3.5 dynamic linker, 5 seconds
with my modified 2.3.6 dynamic linker.

Andrew

dl_test.sh (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Roland McGrath
In reply to this post by Andrew Chatham
Why is prelink not sufficient for you?

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

H.J. Lu-27
On Fri, Jan 06, 2006 at 03:58:52AM -0800, Roland McGrath wrote:
> Why is prelink not sufficient for you?

Does prelink work with dlopened DSOs?


H.J.

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Andrew Chatham
In reply to this post by Roland McGrath
On 1/6/06, Roland McGrath <[hidden email]> wrote:
> Why is prelink not sufficient for you?

These are unittests, so they only get executed once or twice per
linkage. Prelinking is slower than just letting the dynamic linker
take care of things. This may be a unique use case, so maybe prelink
is good enough for everyone else. Even so, it'd be nice if you didn't
have to use prelink.

I'm going to try to fix this for us somehow and release a patch,
whether it goes into the official glibc or not. Perhaps my efforts
would be better directed towards speeding up prelink. I haven't
actually looked at the implementation yet.

Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Roland McGrath
In reply to this post by H.J. Lu-27
> On Fri, Jan 06, 2006 at 03:58:52AM -0800, Roland McGrath wrote:
> > Why is prelink not sufficient for you?
>
> Does prelink work with dlopened DSOs?

It could if you told it to prelink them, I think.  
But Andrew's posting talking about required libraries, not dlopen.

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Roland McGrath
In reply to this post by Andrew Chatham
> These are unittests, so they only get executed once or twice per
> linkage.

Then this is not a case that really matters to do something about system-wide.
We put optimization efforts towards the common cases, not the corners.

> Prelinking is slower than just letting the dynamic linker take care of
> things.

That will probably always be the case if you are only ever going to run the
binary once.

> I'm going to try to fix this for us somehow and release a patch,
> whether it goes into the official glibc or not. Perhaps my efforts
> would be better directed towards speeding up prelink. I haven't
> actually looked at the implementation yet.

What we'll encourage is whatever efforts help to optimize cases that affect
a lot of users.  Linking something and then running it only once, is common
for you and me, but not common among dynamic linker invocations overall.
If you find improvements to the dynamic linker that improve your case
without hurting others, nor (further) overly complicating the code nor
other downsides, we will of course be receptive.  There are certainly other
cases of programs that load way too many DSOs, and some of those are with
dlopen and thus much harder to help with prelink.  But algorithm
improvements or tunings need to be beneficial across the board, or at least
beneficial to worthwhile cases while not harming other common cases.


Thanks,
Roland

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Andrew Chatham
On 1/6/06, Roland McGrath <[hidden email]> wrote:

> > I'm going to try to fix this for us somehow and release a patch,
> > whether it goes into the official glibc or not. Perhaps my efforts
> > would be better directed towards speeding up prelink. I haven't
> > actually looked at the implementation yet.
>
> What we'll encourage is whatever efforts help to optimize cases that affect
> a lot of users.  Linking something and then running it only once, is common
> for you and me, but not common among dynamic linker invocations overall.
> If you find improvements to the dynamic linker that improve your case
> without hurting others, nor (further) overly complicating the code nor
> other downsides, we will of course be receptive.  There are certainly other
> cases of programs that load way too many DSOs, and some of those are with
> dlopen and thus much harder to help with prelink.  But algorithm
> improvements or tunings need to be beneficial across the board, or at least
> beneficial to worthwhile cases while not harming other common cases.

Fair enough, though it seems similar LD_BIND_NOW, perhaps a bit more
complicated. My current patch is a 50 lines touching the dynamic
linker and 150 to implement the bloom filters. I'll just continue on
my own and put the patch on code.google.com when I can.

A broader takeaway is that the code for searching the ELF hash tables
is pretty slow, if just putting a bloom filter in front of it sped
things up so much. This affects everyone using DSOs, though perhaps
not enough to notice. Not sure there's much room to fix it, though,
since the hash table is part of the ELF spec and looks pretty
reasonable. As I said before, Michael had good luck just making the
hash tables bigger.

Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up the dynamic linker with 100s of DSOs?

Roland McGrath
> Fair enough, though it seems similar LD_BIND_NOW, perhaps a bit more
> complicated. My current patch is a 50 lines touching the dynamic
> linker and 150 to implement the bloom filters. I'll just continue on
> my own and put the patch on code.google.com when I can.

If you want to contribute to the community, please post your patches here
(and of course be prepared to assign copyright).

> A broader takeaway is that the code for searching the ELF hash tables is
> pretty slow, if just putting a bloom filter in front of it sped things up
> so much. This affects everyone using DSOs, though perhaps not enough to
> notice. Not sure there's much room to fix it, though, since the hash
> table is part of the ELF spec and looks pretty reasonable.

The format is the format and extensions for speedups would need a pretty
compelling case.  But contributions to optimize the code we have for the
format we have are of course welcome.

> As I said before, Michael had good luck just making the hash tables bigger.

This is under the control of the linker.  I'm sure that linker changes to
do a better job choosing good table sizes would be welcome (but we here are
not the linker maintainers).


Thanks,
Roland