Speeding up the dynamic linker with 100s of DSOs?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Speeding up the dynamic linker with 100s of DSOs?

John Reiser
> We have some binaries that get linked with an absurd number of dynamic
> shared objects (up to 700 or so), and it can take a very long time for
> these programs to execute. It can take 30 seconds just to get into
> main(), because it takes so long to resolve the global constructors.
> Each symbol resolution goes through the list of DSOs, testing each
> hash table, until it resolves the symbol.

>>Why is prelink not sufficient for you?
> These are unittests, so they only get executed once or twice per
> linkage. Prelinking is slower than just letting the dynamic linker
> take care of things.

> As I said before, Michael had good luck just making the
> hash tables bigger. ... At this point, it takes 7 seconds to reach main()
> instead of 30, which still isn't so great, but it's a start.

The DT_HASH table uses bucket heads and an overflow area with closed
chaining, so it is N:1 speedup over linear search.  For a given chain,
the curent static binder ld makes no effort to assign adjacent addresses
for each of the slot list, the Elf32_Sym structs, and the name strings.
There is an opportunity to reduce page faults and cache misses
when the average chain length is more than 1.

If the 700 dynamic shared objects can be processed before execution
of the unit tests, then there are other opportunities.  One is shown in
http://bitwagon.com/elfvector.html  where each DSO is assigned to a
fixed address that the other DSOs do not discover until runtime.  The
runtime linkage uses transfer vectors with fixed slots, and the speedup
is approximately the ratio N_symbols:N_DSOs.  Construction and maintenance
of the address-space assignments can be semi-automatic.  Coordination with
prelink and kernel vDSO policy is required: avoid each other's regions.
Even more attention is required if the same symbol name has multiple

Another possibility is the feature of pre-selecting (and recording during
static bind) which DSO contains the definition to be used at runtime.
(This may be a hint or a requirement.)  Solaris has such a feature.
For references in the same DSO as the definition, then the static bind
option -Bsymbolic may be of use.  The static binder ought to have an
"eager" option for name resolution, too, instead of delaying every .globl
name resolution until runtime.  The differs from current policy,
but so does the performance.

If the only use of the 700 dynamic shared objects is for the unit tests,
then the 700 DSOs can be built differently.  Build each as an ET_EXEC file
with fixed address, supplying the known SHN_ABS value of symbols from all
previous shared objects.  Study the section "Linker scripts" from "info ld".
This approach does not even require -fPIC compilation.  Of course such a
scheme is very brittle and has high cost in the face of change; but a
once-only usage scenario might benefit.