Getting offset of inital-exec TLS variables on GNU/Linux

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting offset of inital-exec TLS variables on GNU/Linux

Florian Weimer-5
Is it possible to obtain the offset of initial-exec TLS variable on
GNU/Linux?

It doesn't seem so because GDB executes the DWARF to access the TLS
variable, so the offset is an implementation detail.  Although it is
often visible at the ELF layer.

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Getting offset of inital-exec TLS variables on GNU/Linux

Simon Marchi-4
On 2019-05-17 6:21 a.m., Florian Weimer wrote:
> Is it possible to obtain the offset of initial-exec TLS variable on
> GNU/Linux?
>
> It doesn't seem so because GDB executes the DWARF to access the TLS
> variable, so the offset is an implementation detail.  Although it is
> often visible at the ELF layer.
>
> Thanks,
> Florian

Can you clarify a little bit?

You are looking for the offset the variable from which point of reference:

- The start of the TLS area of this module?
- The start of the TLS area of the current thread?
- Something else?

Also, are you looking for something you can find statically, with just the
executable, or are you working in the context of the live process?

Simon
Reply | Threaded
Open this post in threaded view
|

Re: Getting offset of inital-exec TLS variables on GNU/Linux

Florian Weimer-5
* Simon Marchi:

> On 2019-05-17 6:21 a.m., Florian Weimer wrote:
>> Is it possible to obtain the offset of initial-exec TLS variable on
>> GNU/Linux?
>>
>> It doesn't seem so because GDB executes the DWARF to access the TLS
>> variable, so the offset is an implementation detail.  Although it is
>> often visible at the ELF layer.
>>
>> Thanks,
>> Florian
>
> Can you clarify a little bit?
>
> You are looking for the offset the variable from which point of reference:
>
> - The start of the TLS area of this module?
> - The start of the TLS area of the current thread?

The offset from something related to the thread pointer to the variable,
for cases where this is constant (specifically, initial-exec TLS
variables).

> Also, are you looking for something you can find statically, with just the
> executable, or are you working in the context of the live process?

I have a live process, and it would be best if the information matched
that process (even if it uses different libraries than those currently
installed in the file system).

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Getting offset of inital-exec TLS variables on GNU/Linux

Simon Marchi-4
On 2019-05-17 10:34 a.m., Florian Weimer wrote:

> * Simon Marchi:
>
>> On 2019-05-17 6:21 a.m., Florian Weimer wrote:
>>> Is it possible to obtain the offset of initial-exec TLS variable on
>>> GNU/Linux?
>>>
>>> It doesn't seem so because GDB executes the DWARF to access the TLS
>>> variable, so the offset is an implementation detail.  Although it is
>>> often visible at the ELF layer.
>>>
>>> Thanks,
>>> Florian
>>
>> Can you clarify a little bit?
>>
>> You are looking for the offset the variable from which point of reference:
>>
>> - The start of the TLS area of this module?
>> - The start of the TLS area of the current thread?
>
> The offset from something related to the thread pointer to the variable,
> for cases where this is constant (specifically, initial-exec TLS
> variables).
>
>> Also, are you looking for something you can find statically, with just the
>> executable, or are you working in the context of the live process?
>
> I have a live process, and it would be best if the information matched
> that process (even if it uses different libraries than those currently
> installed in the file system).

Hi Florian,

I am still a bit unsure of what you are looking for concretely.  Are you looking for
a GDB command to print this offset?  Do you need to compute the offset in an external
tool?  what information are you starting with? I think that a bit more context would
help us help you.

GDB uses libthread_db to get the location of TLS variables, which leaves all
implementation details about this to glibc.  And since you are a glibc maintainer, I
am not too sure what I can teach you about this, as you probably know more about this
than I do.

Simon

Reply | Threaded
Open this post in threaded view
|

Re: Getting offset of inital-exec TLS variables on GNU/Linux

Florian Weimer-5
* Simon Marchi:

> On 2019-05-17 10:34 a.m., Florian Weimer wrote:
>> * Simon Marchi:
>>
>>> On 2019-05-17 6:21 a.m., Florian Weimer wrote:
>>>> Is it possible to obtain the offset of initial-exec TLS variable on
>>>> GNU/Linux?
>>>>
>>>> It doesn't seem so because GDB executes the DWARF to access the TLS
>>>> variable, so the offset is an implementation detail.  Although it is
>>>> often visible at the ELF layer.
>>>>
>>>> Thanks,
>>>> Florian
>>>
>>> Can you clarify a little bit?
>>>
>>> You are looking for the offset the variable from which point of reference:
>>>
>>> - The start of the TLS area of this module?
>>> - The start of the TLS area of the current thread?
>>
>> The offset from something related to the thread pointer to the variable,
>> for cases where this is constant (specifically, initial-exec TLS
>> variables).
>>
>>> Also, are you looking for something you can find statically, with just the
>>> executable, or are you working in the context of the live process?
>>
>> I have a live process, and it would be best if the information matched
>> that process (even if it uses different libraries than those currently
>> installed in the file system).
>
> Hi Florian,
>
> I am still a bit unsure of what you are looking for concretely.  Are
> you looking for a GDB command to print this offset?  Do you need to
> compute the offset in an external tool?  what information are you
> starting with? I think that a bit more context would help us help you.

I had hoped to get this offset so that I can access the TLS variable
without loading libthread_db.

I'd be fine with getting the module TLS offset from internal data
structures.

> GDB uses libthread_db to get the location of TLS variables, which
> leaves all implementation details about this to glibc.  And since you
> are a glibc maintainer, I am not too sure what I can teach you about
> this, as you probably know more about this than I do.

If I have libpthread_db loaded and can somehow get the thread pointer
(e.g., $fs_base on x86-64), then I can get the constant I want just by
computing the different between the two, like this:

(gdb) print (void *)&tcache  - (void *)$fs_base
$9 = -80

Going back a bit, I'm not sure what the API contract is for
DW_OP_GNU_push_tls_address.  It's not really clear to me if under the
ELF TLS ABI, there is an expectation that the dynamic linker always
allocates the TLS space for a DSO as a single block.  I've perused the
two documents for the GNU ELF TLS ABI, and this is never spelled out
explicitly.

Outside debugging information, a TLS relocation for non-initial-exec TLS
always consists of a pair of a module ID and an offset.  Therefore, it
should be possible to lazily allocate individual TLS variables within a
DSO, by assigning them separate module IDs.

But what seems to happen in practice is that there is just one TLS block
per DSO, which is allocated at once once the first TLS variable is
accessed.  Furthermore, the entire TLS block is allocated non-lazily if
there is a single initial-exec TLS variable in a DSO.  Based on that, I
conclude that the module IDs are used only to share TLS variables for
the same symbol across multiple modules.  Due to this restriction, the
module ID for a TLS variable can be inferred from the object that
contains the DW_OP_GNU_push_tls_address opcode (and hopefully locating
the object based on symbol name in the debugger matches the way dynamic
linker search symbols).

I wonder if all this implicitly but firmly encodes a correspondence
between the argument to DW_OP_GNU_push_tls_address and the offset of a
TLS variable, within the TLS block for a DSO—if the DSO has the
DF_STATIC_TLS flag set.  Asumming this is indeed true, we could add the
TLS offset of a DSO to the public part of the link map in the glibc
dynamic loader to help debuggers.  For TLS variables defined in
DF_STATIC_TLS DSOs, it should then be possible to access the TLS
variable without the help of libthread_db, assuming that we teach GDB
how to compute the TLS variable address from the thread pointer, DSO TLS
offset, and variable offset (something binutils seems to know for
several targets already, to implement relaxations).

Would this be a reasonable thing to do?

(Cc:ing Carlos, who probably knows what is really going on here.)

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: Getting offset of inital-exec TLS variables on GNU/Linux

Carlos O'Donell-6
On 5/24/19 11:23 AM, Florian Weimer wrote:

> * Simon Marchi:
>
>> On 2019-05-17 10:34 a.m., Florian Weimer wrote:
>>> * Simon Marchi:
>>>
>>>> On 2019-05-17 6:21 a.m., Florian Weimer wrote:
>>>>> Is it possible to obtain the offset of initial-exec TLS variable on
>>>>> GNU/Linux?
>>>>>
>>>>> It doesn't seem so because GDB executes the DWARF to access the TLS
>>>>> variable, so the offset is an implementation detail.  Although it is
>>>>> often visible at the ELF layer.
>>>>>
>>>>> Thanks,
>>>>> Florian
>>>>
>>>> Can you clarify a little bit?
>>>>
>>>> You are looking for the offset the variable from which point of reference:
>>>>
>>>> - The start of the TLS area of this module?
>>>> - The start of the TLS area of the current thread?
>>>
>>> The offset from something related to the thread pointer to the variable,
>>> for cases where this is constant (specifically, initial-exec TLS
>>> variables).
>>>
>>>> Also, are you looking for something you can find statically, with just the
>>>> executable, or are you working in the context of the live process?
>>>
>>> I have a live process, and it would be best if the information matched
>>> that process (even if it uses different libraries than those currently
>>> installed in the file system).
>>
>> Hi Florian,
>>
>> I am still a bit unsure of what you are looking for concretely.  Are
>> you looking for a GDB command to print this offset?  Do you need to
>> compute the offset in an external tool?  what information are you
>> starting with? I think that a bit more context would help us help you.
>
> I had hoped to get this offset so that I can access the TLS variable
> without loading libthread_db.
>
> I'd be fine with getting the module TLS offset from internal data
> structures.
>
>> GDB uses libthread_db to get the location of TLS variables, which
>> leaves all implementation details about this to glibc.  And since you
>> are a glibc maintainer, I am not too sure what I can teach you about
>> this, as you probably know more about this than I do.
>
> If I have libpthread_db loaded and can somehow get the thread pointer
> (e.g., $fs_base on x86-64), then I can get the constant I want just by
> computing the different between the two, like this:
>
> (gdb) print (void *)&tcache  - (void *)$fs_base
> $9 = -80
>
> Going back a bit, I'm not sure what the API contract is for
> DW_OP_GNU_push_tls_address.  It's not really clear to me if under the
> ELF TLS ABI, there is an expectation that the dynamic linker always
> allocates the TLS space for a DSO as a single block.  I've perused the
> two documents for the GNU ELF TLS ABI, and this is never spelled out
> explicitly.

You are correct, it is not spelled out that the space for the loaded
module needs to be in a single contiguous region of memory, but it is
implied by the design of PT_TLS.

In practice we have no gaps because the initialization image for the block,
particularly for thread-local initialized global data, is copied as one
continuous block. If we wanted to support gaps we would need a more complex
definition than just the PT_TLS marker we use to identify that region (and
which sections go into that region during static link).

Does this answer your question about why it needs to be a single
continuous block?

Which two documents did you review?
 
> Outside debugging information, a TLS relocation for non-initial-exec TLS
> always consists of a pair of a module ID and an offset.  Therefore, it
> should be possible to lazily allocate individual TLS variables within a
> DSO, by assigning them separate module IDs.

You should be able to do that, but for each variable you must have access
to the size via symtab st_size to be able to copy the potentially initialized
value from PT_TLS, otherwise the block must be the full size and you must
initialize it as if it were all the data for the DSO.

> But what seems to happen in practice is that there is just one TLS block
> per DSO, which is allocated at once once the first TLS variable is
> accessed.  Furthermore, the entire TLS block is allocated non-lazily if
> there is a single initial-exec TLS variable in a DSO.  

Just to be clear there is one TLS block per DSO per thread, which is allocated
once.

Yes, the entire TLS block is allocated non-lazily, and it's initializing image
is copied from PT_TLS for that module.

Once you touch any TLS variable for the DSO the whole DSO is initialized for
use.

Yes, any IE TLS vars force the allocation of the whole block because there could
be IE TLS var uses immediately after startup relocation processing of the TPOFF
GOT relocations.

> Based on that, I
> conclude that the module IDs are used only to share TLS variables for
> the same symbol across multiple modules.

I do not think this conclusion is accurate.

What you are observing is a consequence of the fact that the static linker,
during construction of the binary, optmizes all DSOs seen at static link time
and places them into the initial TLS block. And as many of the references are
optimized into TLS IE, which have no module ID at all (they don't need it).

The module ID is intended to reference the module or DSO, to allow the memory
for that module be loaded lazily.

> Due to this restriction, the
> module ID for a TLS variable can be inferred from the object that
> contains the DW_OP_GNU_push_tls_address opcode (and hopefully locating
> the object based on symbol name in the debugger matches the way dynamic
> linker search symbols).

I don't understand what you mean by "restriction" here?

The module ID for a TLS variable is assigned at runtime, and any inspection
process with a live process could find the module ID for the module by
looking for DTPMOD relocations, and reading their values out of the GOT
for the DSO.

In practice it's one mod id per DSO (but as you note above it need not be,
but is restricted by PT_TLS design). So yes, if you see a DSO and it has
a DW_OP_GNU_push_tls_address, you can determine it's module ID by inspecting
the DSOs GOT given dynamic relocation information, and once you have the
module ID you can call __tls_get_addr with the symbol offset to get the final
address (or find the dtv and traverse it for the thread).

I don't actually understand how DW_OP_GNU_push_tls_address works, but the
comments for it seem to indicate it's a hack that glibc is supposed to fix,
but I've never been asked about it :-)

It *looks* like DW_OP_GNU_push_tls_address is just the offset from the start
of the block, but that's still not enough to compute the final address of
the variable in memory.

> I wonder if all this implicitly but firmly encodes a correspondence
> between the argument to DW_OP_GNU_push_tls_address and the offset of a
> TLS variable, within the TLS block for a DSO—if the DSO has the
> DF_STATIC_TLS flag set.  

Yes, I think this is true for DF_STATIC_TLS, because all such DSOs share
the same initial TLS block in the exectuable, so they are all offsets
from the same block.

I do see DWARF5 has DW_OP_form_tls_address, which is effectively the way
to translate a runtime specific argument + a TLS variable offset into a
final address.

Is DW_OP_form_tls_address the DWARF5 generic equivalent of
DW_OP_GNU_push_tls_address?

> Asumming this is indeed true, we could add the
> TLS offset of a DSO to the public part of the link map in the glibc
> dynamic loader to help debuggers.  

I wouldn't call it the TLS offset of a DSO, instead I would call it
what it is "static TLS offset", and if DF_STATIC_TLS is set for that
map, then you know you can find all the TLS variables without libthread_db.

> For TLS variables defined in
> DF_STATIC_TLS DSOs, it should then be possible to access the TLS
> variable without the help of libthread_db, assuming that we teach GDB
> how to compute the TLS variable address from the thread pointer, DSO TLS
> offset, and variable offset (something binutils seems to know for
> several targets already, to implement relaxations).

Right, you would need:

TP + DSO offset + VAR offset = final symbol address.

The VAR offset is actually the DW_OP_const<n><x> operations prior to the
DW_OP_form_tls_address call.

TP + DSO offset is 'td_thr_tlsbase' from nptl_db, so you would be exposing
only the DSO offset value in the link map *only* for the DF_STATIC_TLS case,
because it's a fixed value that won't change.

Note that 'td_thr_tls_get_addr' from nptl_db does the above calculation.

We'd be teaching gdb how to manually run 'td_thr_tlsbase' for the limited
case of DF_STATIC_TLS. It would solve some of the problems we have, and
remove any hacks for errno, but would not fully solve dynamic TLS variable
access in a single threaded program (though it's a big step forward).

> Would this be a reasonable thing to do?
>
> (Cc:ing Carlos, who probably knows what is really going on here.)

I think your conclusion is sound, even if not all the intermediate steps
made sense to me.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: Getting offset of inital-exec TLS variables on GNU/Linux

Florian Weimer-5
* Carlos O'Donell:

>> Going back a bit, I'm not sure what the API contract is for
>> DW_OP_GNU_push_tls_address.  It's not really clear to me if under the
>> ELF TLS ABI, there is an expectation that the dynamic linker always
>> allocates the TLS space for a DSO as a single block.  I've perused the
>> two documents for the GNU ELF TLS ABI, and this is never spelled out
>> explicitly.
>
> You are correct, it is not spelled out that the space for the loaded
> module needs to be in a single contiguous region of memory, but it is
> implied by the design of PT_TLS.
>
> In practice we have no gaps because the initialization image for the block,
> particularly for thread-local initialized global data, is copied as one
> continuous block. If we wanted to support gaps we would need a more complex
> definition than just the PT_TLS marker we use to identify that region (and
> which sections go into that region during static link).
>
> Does this answer your question about why it needs to be a single
> continuous block?

To some extent.  Global TLS symbols with a dynamic symbol entry do have
size information attached, so we could just allocate that much and copy
the initialization from the PT_TLS segment.  It's what I would have
expected to happen.

> Which two documents did you review?

<https://akkadia.org/drepper/tls.pdf>
<https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt>

>  
>> Outside debugging information, a TLS relocation for non-initial-exec TLS
>> always consists of a pair of a module ID and an offset.  Therefore, it
>> should be possible to lazily allocate individual TLS variables within a
>> DSO, by assigning them separate module IDs.
>
> You should be able to do that, but for each variable you must have
> access to the size via symtab st_size to be able to copy the
> potentially initialized value from PT_TLS, otherwise the block must be
> the full size and you must initialize it as if it were all the data
> for the DSO.

Right.  For local symbols, we do not necessarily have this size
information.

>> But what seems to happen in practice is that there is just one TLS block
>> per DSO, which is allocated at once once the first TLS variable is
>> accessed.  Furthermore, the entire TLS block is allocated non-lazily if
>> there is a single initial-exec TLS variable in a DSO.  
>
> Just to be clear there is one TLS block per DSO per thread, which is
> allocated once.
>
> Yes, the entire TLS block is allocated non-lazily, and it's
> initializing image is copied from PT_TLS for that module.
>
> Once you touch any TLS variable for the DSO the whole DSO is
> initialized for use.
>
> Yes, any IE TLS vars force the allocation of the whole block because
> there could be IE TLS var uses immediately after startup relocation
> processing of the TPOFF GOT relocations.

The question is if this is an absolute requirement.

Alex's document alludes to a different possibility:


The use of TLS descriptors to access thread-local variables would
enable the compression of the DTV such that it contained only entries
for non-static modules.  Static ones could be given negative ids, such
that legacy relocations and direct calls to __tls_get_addr() could
still work correctly, but entries could be omitted from the DTV, and
the DTV entries would no longer need the boolean currently used to
denote entries that are in Static TLS.


Maybe that could enable separate allocation of initial-exec and other
TLS models, too.  Then DF_STATIC_TLS as an indicator for the
libthread_db bypass would no longer work.

>> Based on that, I conclude that the module IDs are used only to share
>> TLS variables for the same symbol across multiple modules.
>
> I do not think this conclusion is accurate.
>
> What you are observing is a consequence of the fact that the static
> linker, during construction of the binary, optmizes all DSOs seen at
> static link time and places them into the initial TLS block. And as
> many of the references are optimized into TLS IE, which have no module
> ID at all (they don't need it).
>
> The module ID is intended to reference the module or DSO, to allow the
> memory for that module be loaded lazily.

Maybe I phrased my conclusion poorly.  From an algorithmic point of
view, outside the debugging information, I do not think there is an
intrinsic reason why a single DSO could have just one TLS module ID.

>> Due to this restriction, the module ID for a TLS variable can be
>> inferred from the object that contains the DW_OP_GNU_push_tls_address
>> opcode (and hopefully locating the object based on symbol name in the
>> debugger matches the way dynamic linker search symbols).
>
> I don't understand what you mean by "restriction" here?
>
> The module ID for a TLS variable is assigned at runtime, and any
> inspection process with a live process could find the module ID for
> the module by looking for DTPMOD relocations, and reading their values
> out of the GOT for the DSO.

I don't think you can look at the relocation.  The thread-local variable
can be in scope in a compilation unit, but there might not be any
reference to it, so there is no relocation that would reveal its
location.

> In practice it's one mod id per DSO (but as you note above it need not
> be, but is restricted by PT_TLS design). So yes, if you see a DSO and
> it has a DW_OP_GNU_push_tls_address, you can determine it's module ID
> by inspecting the DSOs GOT given dynamic relocation information, and
> once you have the module ID you can call __tls_get_addr with the
> symbol offset to get the final address (or find the dtv and traverse
> it for the thread).
>
> I don't actually understand how DW_OP_GNU_push_tls_address works, but
> the comments for it seem to indicate it's a hack that glibc is
> supposed to fix, but I've never been asked about it :-)
>
> It *looks* like DW_OP_GNU_push_tls_address is just the offset from the
> start of the block, but that's still not enough to compute the final
> address of the variable in memory.

My concern is that the interface, in theory, would allow very different
address translations through libthread_db, similar to how we use the
DSO-internal offset as a hash table key in _dl_make_tlsdesc_dynamic.
I'm not sure the staged computation (first the TLS base address, then
the combination with the offset) completely prevents that.

Teaching GDB how this works today would thus constrain future evolution
of the internal library design.

But if you say that we can perform lazy allocation only en bloc, once
per DSO, then that doesn't matter.

>> Asumming this is indeed true, we could add the
>> TLS offset of a DSO to the public part of the link map in the glibc
>> dynamic loader to help debuggers.  
>
> I wouldn't call it the TLS offset of a DSO, instead I would call it
> what it is "static TLS offset", and if DF_STATIC_TLS is set for that
> map, then you know you can find all the TLS variables without libthread_db.
>
>> For TLS variables defined in
>> DF_STATIC_TLS DSOs, it should then be possible to access the TLS
>> variable without the help of libthread_db, assuming that we teach GDB
>> how to compute the TLS variable address from the thread pointer, DSO TLS
>> offset, and variable offset (something binutils seems to know for
>> several targets already, to implement relaxations).
>
> Right, you would need:
>
> TP + DSO offset + VAR offset = final symbol address.
>
> The VAR offset is actually the DW_OP_const<n><x> operations prior to the
> DW_OP_form_tls_address call.
>
> TP + DSO offset is 'td_thr_tlsbase' from nptl_db, so you would be exposing
> only the DSO offset value in the link map *only* for the DF_STATIC_TLS case,
> because it's a fixed value that won't change.
>
> Note that 'td_thr_tls_get_addr' from nptl_db does the above calculation.
>
> We'd be teaching gdb how to manually run 'td_thr_tlsbase' for the limited
> case of DF_STATIC_TLS. It would solve some of the problems we have, and
> remove any hacks for errno, but would not fully solve dynamic TLS variable
> access in a single threaded program (though it's a big step forward).

If we fix the startup problem with dlopen and initial-exec TLS in glibc
(changes that do not affect ABI at all), more people can use
initial-exec TLS and benefit from the GDB enhancement.  And yes, while
the general problem of TCB placement for future initial-exec TLS
allocations is unsolvable, we can be much smarter about what we do than
today, if we separate TCB allocation from stack allocation.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: Getting offset of inital-exec TLS variables on GNU/Linux

Carlos O'Donell-6
On 5/24/19 3:12 PM, Florian Weimer wrote:

> * Carlos O'Donell:
>
>>> Going back a bit, I'm not sure what the API contract is for
>>> DW_OP_GNU_push_tls_address.  It's not really clear to me if under the
>>> ELF TLS ABI, there is an expectation that the dynamic linker always
>>> allocates the TLS space for a DSO as a single block.  I've perused the
>>> two documents for the GNU ELF TLS ABI, and this is never spelled out
>>> explicitly.
>>
>> You are correct, it is not spelled out that the space for the loaded
>> module needs to be in a single contiguous region of memory, but it is
>> implied by the design of PT_TLS.
>>
>> In practice we have no gaps because the initialization image for the block,
>> particularly for thread-local initialized global data, is copied as one
>> continuous block. If we wanted to support gaps we would need a more complex
>> definition than just the PT_TLS marker we use to identify that region (and
>> which sections go into that region during static link).
>>
>> Does this answer your question about why it needs to be a single
>> continuous block?
>
> To some extent.  Global TLS symbols with a dynamic symbol entry do have
> size information attached, so we could just allocate that much and copy
> the initialization from the PT_TLS segment.  It's what I would have
> expected to happen.

I don't know if such a micro-optimization is justified without a lot more
analysis of existing binaries. I would expect that if any TLS variable is
touched it's simply a constant cost to allocate the entire block, rather
than doing page-at-a-time allocations. You'd have to reserve the whole
address space for the block too, and then track the initialized spans,
again it seems like a lot of work without the data to support it (yet).
I'm not saying it's a bad idea, it's just not justified without more data.

The behaviour you suggest is not the behaviour I would have expected,
but then again it's hard for me to be objective because I've worked with
it for so long. I worked on the hppa port with Dave Anglin and Randolph
Chung, but at the time I don't remember thinking "this is odd, why not
do it on a per-symbol basis like dynamic symbol resolution."

I expect that there are difficult corner cases, and local tls variables
have no size as you point out, so we'd need some other structure to
capture their size.

In summary:
- Optimizing the allocation of the TLS blocks to minimize RSS usage is
  an optimization that doesn't have data to support it.

>> Which two documents did you review?
>
> <https://akkadia.org/drepper/tls.pdf>
> <https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt>

OK.

>>  
>>> Outside debugging information, a TLS relocation for non-initial-exec TLS
>>> always consists of a pair of a module ID and an offset.  Therefore, it
>>> should be possible to lazily allocate individual TLS variables within a
>>> DSO, by assigning them separate module IDs.
>>
>> You should be able to do that, but for each variable you must have
>> access to the size via symtab st_size to be able to copy the
>> potentially initialized value from PT_TLS, otherwise the block must be
>> the full size and you must initialize it as if it were all the data
>> for the DSO.
>
> Right.  For local symbols, we do not necessarily have this size
> information.

Right, and so this limits the solution, maybe sufficiently that it's not
a useful optimziation at that point.

>>> But what seems to happen in practice is that there is just one TLS block
>>> per DSO, which is allocated at once once the first TLS variable is
>>> accessed.  Furthermore, the entire TLS block is allocated non-lazily if
>>> there is a single initial-exec TLS variable in a DSO.  
>>
>> Just to be clear there is one TLS block per DSO per thread, which is
>> allocated once.
>>
>> Yes, the entire TLS block is allocated non-lazily, and it's
>> initializing image is copied from PT_TLS for that module.
>>
>> Once you touch any TLS variable for the DSO the whole DSO is
>> initialized for use.
>>
>> Yes, any IE TLS vars force the allocation of the whole block because
>> there could be IE TLS var uses immediately after startup relocation
>> processing of the TPOFF GOT relocations.
>
> The question is if this is an absolute requirement.

The IE TLS variable accesses are carried out by generated code sequences
that directly access the GOT without any interception mechanism.

Therefore once it is expected the relocation processing is complete we
can immediately have uses of IE TLS variables.

Therefore if your DSO's TLS block has IE TLS variables, then the whole
block needs to be statically allocated, and needs to be allocated immediately
and up front.

I think this is an absolute requirement for IE TLS, whether this is a requirement
for non-IE TLS is an open question.

> Alex's document alludes to a different possibility:
>
> “
> The use of TLS descriptors to access thread-local variables would
> enable the compression of the DTV such that it contained only entries
> for non-static modules.  Static ones could be given negative ids, such
> that legacy relocations and direct calls to __tls_get_addr() could
> still work correctly, but entries could be omitted from the DTV, and
> the DTV entries would no longer need the boolean currently used to
> denote entries that are in Static TLS.
> ”

It's not clear to me what kind of real tangible benefit this optimization
would have. It seems like speculation about reducing DTV size, when that's
not an issue that has ever been reported or observed.

Yes, there is one DTV entry per module, but it's also a tiny entry. I don't
follow what the motivation could be.
 
> Maybe that could enable separate allocation of initial-exec and other
> TLS models, too.  Then DF_STATIC_TLS as an indicator for the
> libthread_db bypass would no longer work.

Maybe, but the concept of there being a single TLS base address for the
DSO is already baked into the nptl_db interface, so the concept of a single
block per module id is fixed.

However, doesn't TLSDEC violate this concept? What happens if you run out
of opportunisitc static tls block space while resolving descriptors? I think
you fall back to allocating more space and that's not in the static TLS block,
so you that would be an example of two blocks. I would have to test this.

>>> Based on that, I conclude that the module IDs are used only to share
>>> TLS variables for the same symbol across multiple modules.
>>
>> I do not think this conclusion is accurate.
>>
>> What you are observing is a consequence of the fact that the static
>> linker, during construction of the binary, optmizes all DSOs seen at
>> static link time and places them into the initial TLS block. And as
>> many of the references are optimized into TLS IE, which have no module
>> ID at all (they don't need it).
>>
>> The module ID is intended to reference the module or DSO, to allow the
>> memory for that module be loaded lazily.
>
> Maybe I phrased my conclusion poorly.  From an algorithmic point of
> view, outside the debugging information, I do not think there is an
> intrinsic reason why a single DSO could have just one TLS module ID.

Correct, and as I point out above, I'm not entirely sure what happens if
tlsdesc runs out of static TLS block space during resolution e.g. if
CHECK_STATIC returns 0 while still needing space for another descriptor.

>>> Due to this restriction, the module ID for a TLS variable can be
>>> inferred from the object that contains the DW_OP_GNU_push_tls_address
>>> opcode (and hopefully locating the object based on symbol name in the
>>> debugger matches the way dynamic linker search symbols).
>>
>> I don't understand what you mean by "restriction" here?
>>
>> The module ID for a TLS variable is assigned at runtime, and any
>> inspection process with a live process could find the module ID for
>> the module by looking for DTPMOD relocations, and reading their values
>> out of the GOT for the DSO.
>
> I don't think you can look at the relocation.  The thread-local variable
> can be in scope in a compilation unit, but there might not be any
> reference to it, so there is no relocation that would reveal its
> location.

Yes, you're right in LD we just load the value for the first symbol, but
there could be N local TLS variables we access with the same call to
__tls_get_addr, and we wouldn't find them.

>> In practice it's one mod id per DSO (but as you note above it need not
>> be, but is restricted by PT_TLS design). So yes, if you see a DSO and
>> it has a DW_OP_GNU_push_tls_address, you can determine it's module ID
>> by inspecting the DSOs GOT given dynamic relocation information, and
>> once you have the module ID you can call __tls_get_addr with the
>> symbol offset to get the final address (or find the dtv and traverse
>> it for the thread).
>>
>> I don't actually understand how DW_OP_GNU_push_tls_address works, but
>> the comments for it seem to indicate it's a hack that glibc is
>> supposed to fix, but I've never been asked about it :-)
>>
>> It *looks* like DW_OP_GNU_push_tls_address is just the offset from the
>> start of the block, but that's still not enough to compute the final
>> address of the variable in memory.
>
> My concern is that the interface, in theory, would allow very different
> address translations through libthread_db, similar to how we use the
> DSO-internal offset as a hash table key in _dl_make_tlsdesc_dynamic.
> I'm not sure the staged computation (first the TLS base address, then
> the combination with the offset) completely prevents that.
>
> Teaching GDB how this works today would thus constrain future evolution
> of the internal library design.
>
> But if you say that we can perform lazy allocation only en bloc, once
> per DSO, then that doesn't matter.

Well... it's a limit of the current implementation.

I appreciate that we want to avoid constraining the future implementation.

So why can't we load libthread_db today?

Is there really a requirement on having libpthread.so.0 loaded?

Can't we tackle that problem instead?

>>> Asumming this is indeed true, we could add the
>>> TLS offset of a DSO to the public part of the link map in the glibc
>>> dynamic loader to help debuggers.  
>>
>> I wouldn't call it the TLS offset of a DSO, instead I would call it
>> what it is "static TLS offset", and if DF_STATIC_TLS is set for that
>> map, then you know you can find all the TLS variables without libthread_db.
>>
>>> For TLS variables defined in
>>> DF_STATIC_TLS DSOs, it should then be possible to access the TLS
>>> variable without the help of libthread_db, assuming that we teach GDB
>>> how to compute the TLS variable address from the thread pointer, DSO TLS
>>> offset, and variable offset (something binutils seems to know for
>>> several targets already, to implement relaxations).
>>
>> Right, you would need:
>>
>> TP + DSO offset + VAR offset = final symbol address.
>>
>> The VAR offset is actually the DW_OP_const<n><x> operations prior to the
>> DW_OP_form_tls_address call.
>>
>> TP + DSO offset is 'td_thr_tlsbase' from nptl_db, so you would be exposing
>> only the DSO offset value in the link map *only* for the DF_STATIC_TLS case,
>> because it's a fixed value that won't change.
>>
>> Note that 'td_thr_tls_get_addr' from nptl_db does the above calculation.
>>
>> We'd be teaching gdb how to manually run 'td_thr_tlsbase' for the limited
>> case of DF_STATIC_TLS. It would solve some of the problems we have, and
>> remove any hacks for errno, but would not fully solve dynamic TLS variable
>> access in a single threaded program (though it's a big step forward).
>
> If we fix the startup problem with dlopen and initial-exec TLS in glibc
> (changes that do not affect ABI at all), more people can use
> initial-exec TLS and benefit from the GDB enhancement.

What problem is this?

You can use IE today from dlopen, and consume all available static TLS image
space with each new dlopen, eventually it will fail when you run out of space.

> And yes, while
> the general problem of TCB placement for future initial-exec TLS
> allocations is unsolvable, we can be much smarter about what we do than
> today, if we separate TCB allocation from stack allocation.

Yes, splitting out TLS/TCB from thread allocation would be a really good project.

In summary, if I had a wishlist of things I'd like to work on:

- Revisit why we can't always load libthread_db.

- Split TSL/TCB out from stack.

- Allocate all TLS space at dlopen time to avoid runtime failure if malloc
  fails. Enable by default to avoid runtime failures with flag to get back
  old behaviour.

- Revisit moving every arch to TLSDESC.

--
Cheers,
Carlos.