Re: Scheduling x86 dispatch windows

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Quentin Neill
Cross-posting Reza's call for feedback to the binutils list since it
is relevant -
see the last few paragraphs regarding how to "solve the alignment problem".

Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402

Not sure if followups should occur on one list or both.
--
Quentin Neill


On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani <[hidden email]> wrote:

> Hi,
>
> We are in the process of adding a feature to GCC to take advantage
> of a new hardware feature in the latest AMD micro processor. This
> feature requires a certain mix, ordering and alignments in
> instruction sequences to obtain the expected hardware performance.
>
> I am asking the community to review this high level implementation
> design and give me direction or advice.
>
> The new hardware issues two windows of the size N bytes of
> instructions in every cycle. It goes into accelerate mode if the
> windows have the right combination of instructions or alignments. Our
> goal is to maximize the IPC by proper instruction scheduling and
> alignments.
>
> Here is a summary of the most important requirements:
>
> a) Maximum of N instructions per window.
> b) An instruction may cross the first window.
> c) Each window can have maximum of x memory loads and y memory
>    stores .
> d) The total number of immediate constants in the instructions
>    of a window should not exceed k.
> e) The first window must be aligned on 16 byte boundary.
> f) A Window set terminates when a branch exists in a window.
> g) The number of allowed prefixes varies for instructions.
> h) A window set needs to be padded by prefixes in instructions
>    or terminated by nops to ensure adherence to the rules.
>
> We have the following implementation plan for GCC:
>
> 1) Modify the Haifa scheduler to make the desired arrangement of
>    instructions for the two dispatch windows. The scheduler is called
>    once before and once after register allocation as usual. In both
>    cases it performs dispatch scheduling along with its normal job of
>    instruction scheduling.
>
> The advantage of doing it before register allocation is avoiding
> extra dependencies caused by register allocation which may become
> an obstacle to movement of instructions.  The advantage of doing
> it after register allocation is a consideration for spilling code
> which may be generated by the register allocator.
>
> The algorithm we use is:
>
> a) Considering the current dispatch window set, choose the first
>    instruction from ready queue that does not violate dispatch rules.
> b) When an instruction is selected and scheduled, inform the
>    dispatcher code about the instruction. This step keeps track of the
>    instruction content of windows for future evaluation. It also manages
>    the window set by closing and opening new virtual dispatch windows.
>
> 2) Insertion of alignment code.
>
> In x86 alignment is done by inserting prefixes or by generating
> nops. As the object code is generated by the assembler in GCC, some
> information such as sizes of branches are unknown until assembly or
> link time. To do alignments related to dispatch correctly in GCC,
> we need to iteratively compute prefixes and branch sizes until
> its convergence. This pass currently does not exist in GCC, but it
> exists in the assembler.
>
> There are two possible approaches to solve alignment problem.
>
> a)  Let the assembler performs the alignments and padding needed
>     to adhere with the new machine dispatching rules and avoid an extra
>     pass in GCC.
> b)  Add a new pass to mimic what assembler does before generating
>     the assembly listing in GCC and insert the required alignments.
>
> I appreciate your comments on the proposed implementation procedure
> and the choices a or b above.
>
> Reza Yazdani
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

H.J. Lu-30
On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
<[hidden email]> wrote:

> Cross-posting Reza's call for feedback to the binutils list since it
> is relevant -
> see the last few paragraphs regarding how to "solve the alignment problem".
>
> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>
> Not sure if followups should occur on one list or both.
> --
> Quentin Neill
>
>
> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani <[hidden email]> wrote:
>> Hi,
>>
>> We are in the process of adding a feature to GCC to take advantage
>> of a new hardware feature in the latest AMD micro processor. This
>> feature requires a certain mix, ordering and alignments in
>> instruction sequences to obtain the expected hardware performance.
>>
>> I am asking the community to review this high level implementation
>> design and give me direction or advice.
>>
>> The new hardware issues two windows of the size N bytes of
>> instructions in every cycle. It goes into accelerate mode if the
>> windows have the right combination of instructions or alignments. Our
>> goal is to maximize the IPC by proper instruction scheduling and
>> alignments.
>>
>> Here is a summary of the most important requirements:
>>
>> a) Maximum of N instructions per window.
>> b) An instruction may cross the first window.
>> c) Each window can have maximum of x memory loads and y memory
>>    stores .
>> d) The total number of immediate constants in the instructions
>>    of a window should not exceed k.
>> e) The first window must be aligned on 16 byte boundary.
>> f) A Window set terminates when a branch exists in a window.
>> g) The number of allowed prefixes varies for instructions.
>> h) A window set needs to be padded by prefixes in instructions
>>    or terminated by nops to ensure adherence to the rules.
>>
>> We have the following implementation plan for GCC:
>>
>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>    instructions for the two dispatch windows. The scheduler is called
>>    once before and once after register allocation as usual. In both
>>    cases it performs dispatch scheduling along with its normal job of
>>    instruction scheduling.
>>
>> The advantage of doing it before register allocation is avoiding
>> extra dependencies caused by register allocation which may become
>> an obstacle to movement of instructions.  The advantage of doing
>> it after register allocation is a consideration for spilling code
>> which may be generated by the register allocator.
>>
>> The algorithm we use is:
>>
>> a) Considering the current dispatch window set, choose the first
>>    instruction from ready queue that does not violate dispatch rules.
>> b) When an instruction is selected and scheduled, inform the
>>    dispatcher code about the instruction. This step keeps track of the
>>    instruction content of windows for future evaluation. It also manages
>>    the window set by closing and opening new virtual dispatch windows.
>>
>> 2) Insertion of alignment code.
>>
>> In x86 alignment is done by inserting prefixes or by generating
>> nops. As the object code is generated by the assembler in GCC, some
>> information such as sizes of branches are unknown until assembly or
>> link time. To do alignments related to dispatch correctly in GCC,
>> we need to iteratively compute prefixes and branch sizes until
>> its convergence. This pass currently does not exist in GCC, but it
>> exists in the assembler.
>>
>> There are two possible approaches to solve alignment problem.
>>
>> a)  Let the assembler performs the alignments and padding needed
>>     to adhere with the new machine dispatching rules and avoid an extra
>>     pass in GCC.
>> b)  Add a new pass to mimic what assembler does before generating
>>     the assembly listing in GCC and insert the required alignments.
>>
>> I appreciate your comments on the proposed implementation procedure
>> and the choices a or b above.

I don't this should be done in assembler. Assembler should just assemble
the assembly input.

--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Jeff Law
On 06/10/10 13:52, H.J. Lu wrote:

> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
> <[hidden email]>  wrote:
>    
>> Cross-posting Reza's call for feedback to the binutils list since it
>> is relevant -
>> see the last few paragraphs regarding how to "solve the alignment problem".
>>
>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>
>> Not sure if followups should occur on one list or both.
>> --
>> Quentin Neill
>>
>>
>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<[hidden email]>  wrote:
>>      
>>> Hi,
>>>
>>> We are in the process of adding a feature to GCC to take advantage
>>> of a new hardware feature in the latest AMD micro processor. This
>>> feature requires a certain mix, ordering and alignments in
>>> instruction sequences to obtain the expected hardware performance.
>>>
>>> I am asking the community to review this high level implementation
>>> design and give me direction or advice.
>>>
>>> The new hardware issues two windows of the size N bytes of
>>> instructions in every cycle. It goes into accelerate mode if the
>>> windows have the right combination of instructions or alignments. Our
>>> goal is to maximize the IPC by proper instruction scheduling and
>>> alignments.
>>>
>>> Here is a summary of the most important requirements:
>>>
>>> a) Maximum of N instructions per window.
>>> b) An instruction may cross the first window.
>>> c) Each window can have maximum of x memory loads and y memory
>>>     stores .
>>> d) The total number of immediate constants in the instructions
>>>     of a window should not exceed k.
>>> e) The first window must be aligned on 16 byte boundary.
>>> f) A Window set terminates when a branch exists in a window.
>>> g) The number of allowed prefixes varies for instructions.
>>> h) A window set needs to be padded by prefixes in instructions
>>>     or terminated by nops to ensure adherence to the rules.
>>>
>>> We have the following implementation plan for GCC:
>>>
>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>     instructions for the two dispatch windows. The scheduler is called
>>>     once before and once after register allocation as usual. In both
>>>     cases it performs dispatch scheduling along with its normal job of
>>>     instruction scheduling.
>>>
>>> The advantage of doing it before register allocation is avoiding
>>> extra dependencies caused by register allocation which may become
>>> an obstacle to movement of instructions.  The advantage of doing
>>> it after register allocation is a consideration for spilling code
>>> which may be generated by the register allocator.
>>>
>>> The algorithm we use is:
>>>
>>> a) Considering the current dispatch window set, choose the first
>>>     instruction from ready queue that does not violate dispatch rules.
>>> b) When an instruction is selected and scheduled, inform the
>>>     dispatcher code about the instruction. This step keeps track of the
>>>     instruction content of windows for future evaluation. It also manages
>>>     the window set by closing and opening new virtual dispatch windows.
>>>
>>> 2) Insertion of alignment code.
>>>
>>> In x86 alignment is done by inserting prefixes or by generating
>>> nops. As the object code is generated by the assembler in GCC, some
>>> information such as sizes of branches are unknown until assembly or
>>> link time. To do alignments related to dispatch correctly in GCC,
>>> we need to iteratively compute prefixes and branch sizes until
>>> its convergence. This pass currently does not exist in GCC, but it
>>> exists in the assembler.
>>>
>>> There are two possible approaches to solve alignment problem.
>>>
>>> a)  Let the assembler performs the alignments and padding needed
>>>      to adhere with the new machine dispatching rules and avoid an extra
>>>      pass in GCC.
>>> b)  Add a new pass to mimic what assembler does before generating
>>>      the assembly listing in GCC and insert the required alignments.
>>>
>>> I appreciate your comments on the proposed implementation procedure
>>> and the choices a or b above.
>>>        
> I don't this should be done in assembler. Assembler should just assemble
> the assembly input.
>    
That adds quite a bit of complication to the compiler though -- getting
the instruction lengths right (and thus proper packing & alignment) can
be extremely difficult.  I did some experiments with this on a target
with *fixed* instruction lengths a while back and even though the port
tried hard to get lengths right, it would routinely miss something.  
Ultimately I decided that it forcing the compiler to know instruction
lengths with a very high degree of accuracy wasn't a sane thing to
do.    Dealing with variable instruction lengths just adds yet another
complexity to the situation.  Then add the complication of needing to
add specific prefixes or nops and it just gets downright ugly.

I'd probably approach this by having the compiler emit a directive which
states what the desired alignment at a particular point should be, then
allow the assembler to select the best method to get the desired alignment.

jeff



Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Joern Rennecke-4
Quoting Jeff Law <[hidden email]>:

> That adds quite a bit of complication to the compiler though -- getting
> the instruction lengths right (and thus proper packing & alignment) can
> be extremely difficult.  I did some experiments with this on a target
> with *fixed* instruction lengths a while back and even though the port
> tried hard to get lengths right, it would routinely miss something.
> Ultimately I decided that it forcing the compiler to know instruction
> lengths with a very high degree of accuracy wasn't a sane thing to do.
>   Dealing with variable instruction lengths just adds yet another
> complexity to the situation.  Then add the complication of needing to
> add specific prefixes or nops and it just gets downright ugly.

I did add alignment-aware & exact branch shortening to the ARCompact port,
but ultimately the added complexity due to this was also a factor why the
port couldn't go into mainline without an active maintainer.
The code is available on branches.
See PR target/39303.
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Quentin Neill
In reply to this post by Jeff Law
On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <[hidden email]> wrote:

> On 06/10/10 13:52, H.J. Lu wrote:
>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>> <[hidden email]>  wrote:
>>> Cross-posting Reza's call for feedback to the binutils list since it
>>> is relevant - s ee the last few paragraphs regarding how to
>>> "solve the alignment problem".
>>>
>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>
>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<[hidden email]>
>>>  wrote:
>>>> Hi,
>>>>
>>>> We are in the process of adding a feature to GCC to take advantage
>>>> of a new hardware feature in the latest AMD micro processor. This
>>>> feature requires a certain mix, ordering and alignments in
>>>> instruction sequences to obtain the expected hardware performance.
>>>>
>>>> I am asking the community to review this high level implementation
>>>> design and give me direction or advice.
>>>>
>>>> The new hardware issues two windows of the size N bytes of
>>>> instructions in every cycle. It goes into accelerate mode if the
>>>> windows have the right combination of instructions or alignments. Our
>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>> alignments.
>>>>
>>>> Here is a summary of the most important requirements:
>>>>
>>>> a) Maximum of N instructions per window.
>>>> b) An instruction may cross the first window.
>>>> c) Each window can have maximum of x memory loads and y memory
>>>>    stores .
>>>> d) The total number of immediate constants in the instructions
>>>>    of a window should not exceed k.
>>>> e) The first window must be aligned on 16 byte boundary.
>>>> f) A Window set terminates when a branch exists in a window.
>>>> g) The number of allowed prefixes varies for instructions.
>>>> h) A window set needs to be padded by prefixes in instructions
>>>>    or terminated by nops to ensure adherence to the rules.
>>>>
>>>> We have the following implementation plan for GCC:
>>>>
>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>    once before and once after register allocation as usual. In both
>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>    instruction scheduling.
>>>>
>>>> The advantage of doing it before register allocation is avoiding
>>>> extra dependencies caused by register allocation which may become
>>>> an obstacle to movement of instructions.  The advantage of doing
>>>> it after register allocation is a consideration for spilling code
>>>> which may be generated by the register allocator.
>>>>
>>>> The algorithm we use is:
>>>>
>>>> a) Considering the current dispatch window set, choose the first
>>>>    instruction from ready queue that does not violate dispatch rules.
>>>> b) When an instruction is selected and scheduled, inform the
>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>    instruction content of windows for future evaluation. It also manages
>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>
>>>> 2) Insertion of alignment code.
>>>>
>>>> In x86 alignment is done by inserting prefixes or by generating
>>>> nops. As the object code is generated by the assembler in GCC, some
>>>> information such as sizes of branches are unknown until assembly or
>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>> we need to iteratively compute prefixes and branch sizes until
>>>> its convergence. This pass currently does not exist in GCC, but it
>>>> exists in the assembler.
>>>>
>>>> There are two possible approaches to solve alignment problem.
>>>>
>>>> a)  Let the assembler performs the alignments and padding needed
>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>     pass in GCC.
>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>     the assembly listing in GCC and insert the required alignments.
>>>>
>>>> I appreciate your comments on the proposed implementation procedure
>>>> and the choices a or b above.
>>>>
>>
>> I don't this should be done in assembler. Assembler should just assemble
>> the assembly input.
>
> That adds quite a bit of complication to the compiler though -- getting the
> instruction lengths right (and thus proper packing & alignment) can be
> extremely difficult.  I did some experiments with this on a target with
> *fixed* instruction lengths a while back and even though the port tried hard
> to get lengths right, it would routinely miss something.  Ultimately I
> decided that it forcing the compiler to know instruction lengths with a very
> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
> instruction lengths just adds yet another complexity to the situation.  Then
> add the complication of needing to add specific prefixes or nops and it just
> gets downright ugly.
>
> I'd probably approach this by having the compiler emit a directive which
> states what the desired alignment at a particular point should be, then
> allow the assembler to select the best method to get the desired alignment.

Jeff,

This is exactly part of our binutils side of the proposal, which I'll
outline now

1. Allow multiple prefixes for ADDR and DS (and possibly others)
a) multiple prefixes are benign in certain modes and are thus chosen for padding
b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
more explicit (and they don't trigger a call to
md_flush_pending_output)

2. Add new pseudo-op to delineate alignment boundaries.  This is
needed to signal any dispatch engine (below) to pad.  Here are my top
two candidates, any feedback is appreciated:
a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
b) ".padalign" which calla a new "md_pad_align()"

3. Add dispatch optimization infrastructure which
a) is guarded by -mtune flag (and possibly other -f style flags)
b) tracks assembled instruction attributes and their fragments
c) can pad (insert benign prefixes) into previously assembled fragments
d) maintains dispatch engine state (according to some subset of Reza's rules)

Discussion:

The flags in 3a) should guard against these changes affecting current behavior.

The assembly tracking in 3b) is for bookkeeping only; the padding in
3c) would only occur when a compiler uses the pseudo-op in 2) or when
the dispatch engine in 3d) signals.

For compilers that know exactly how to pad for the new processor, the
ability to
pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.

For assembly programs and/or compilers that don't choose to do any
dispatch optimization, it's anticipated that the engine in 3d) would
be useful for optimizing for -mtune=bdver1

I'll post patches for these soon.
--
Quentin Neill
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

H.J. Lu-30
On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
<[hidden email]> wrote:

> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <[hidden email]> wrote:
>> On 06/10/10 13:52, H.J. Lu wrote:
>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>> <[hidden email]>  wrote:
>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>> is relevant - s ee the last few paragraphs regarding how to
>>>> "solve the alignment problem".
>>>>
>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>
>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<[hidden email]>
>>>>  wrote:
>>>>> Hi,
>>>>>
>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>> feature requires a certain mix, ordering and alignments in
>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>
>>>>> I am asking the community to review this high level implementation
>>>>> design and give me direction or advice.
>>>>>
>>>>> The new hardware issues two windows of the size N bytes of
>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>> windows have the right combination of instructions or alignments. Our
>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>> alignments.
>>>>>
>>>>> Here is a summary of the most important requirements:
>>>>>
>>>>> a) Maximum of N instructions per window.
>>>>> b) An instruction may cross the first window.
>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>    stores .
>>>>> d) The total number of immediate constants in the instructions
>>>>>    of a window should not exceed k.
>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>> f) A Window set terminates when a branch exists in a window.
>>>>> g) The number of allowed prefixes varies for instructions.
>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>
>>>>> We have the following implementation plan for GCC:
>>>>>
>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>    once before and once after register allocation as usual. In both
>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>    instruction scheduling.
>>>>>
>>>>> The advantage of doing it before register allocation is avoiding
>>>>> extra dependencies caused by register allocation which may become
>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>> it after register allocation is a consideration for spilling code
>>>>> which may be generated by the register allocator.
>>>>>
>>>>> The algorithm we use is:
>>>>>
>>>>> a) Considering the current dispatch window set, choose the first
>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>
>>>>> 2) Insertion of alignment code.
>>>>>
>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>> information such as sizes of branches are unknown until assembly or
>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>> exists in the assembler.
>>>>>
>>>>> There are two possible approaches to solve alignment problem.
>>>>>
>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>     pass in GCC.
>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>
>>>>> I appreciate your comments on the proposed implementation procedure
>>>>> and the choices a or b above.
>>>>>
>>>
>>> I don't this should be done in assembler. Assembler should just assemble
>>> the assembly input.
>>
>> That adds quite a bit of complication to the compiler though -- getting the
>> instruction lengths right (and thus proper packing & alignment) can be
>> extremely difficult.  I did some experiments with this on a target with
>> *fixed* instruction lengths a while back and even though the port tried hard
>> to get lengths right, it would routinely miss something.  Ultimately I
>> decided that it forcing the compiler to know instruction lengths with a very
>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>> instruction lengths just adds yet another complexity to the situation.  Then
>> add the complication of needing to add specific prefixes or nops and it just
>> gets downright ugly.
>>
>> I'd probably approach this by having the compiler emit a directive which
>> states what the desired alignment at a particular point should be, then
>> allow the assembler to select the best method to get the desired alignment.
>
> Jeff,
>
> This is exactly part of our binutils side of the proposal, which I'll
> outline now
>
> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
> a) multiple prefixes are benign in certain modes and are thus chosen for padding
> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
> more explicit (and they don't trigger a call to
> md_flush_pending_output)
>
> 2. Add new pseudo-op to delineate alignment boundaries.  This is
> needed to signal any dispatch engine (below) to pad.  Here are my top
> two candidates, any feedback is appreciated:
> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
> b) ".padalign" which calla a new "md_pad_align()"
>
> 3. Add dispatch optimization infrastructure which
> a) is guarded by -mtune flag (and possibly other -f style flags)
> b) tracks assembled instruction attributes and their fragments
> c) can pad (insert benign prefixes) into previously assembled fragments
> d) maintains dispatch engine state (according to some subset of Reza's rules)
>
> Discussion:
>
> The flags in 3a) should guard against these changes affecting current behavior.
>
> The assembly tracking in 3b) is for bookkeeping only; the padding in
> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
> the dispatch engine in 3d) signals.
>
> For compilers that know exactly how to pad for the new processor, the
> ability to
> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>
> For assembly programs and/or compilers that don't choose to do any
> dispatch optimization, it's anticipated that the engine in 3d) would
> be useful for optimizing for -mtune=bdver1
>
> I'll post patches for these soon.

Can you do it with directives only?


--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Quentin Neill
On Thu, Jun 10, 2010 at 4:08 PM, H.J. Lu <[hidden email]> wrote:

> On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
> <[hidden email]> wrote:
>> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <[hidden email]> wrote:
>>> On 06/10/10 13:52, H.J. Lu wrote:
>>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>>> <[hidden email]>  wrote:
>>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>>> is relevant - s ee the last few paragraphs regarding how to
>>>>> "solve the alignment problem".
>>>>>
>>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>>
>>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<[hidden email]>
>>>>>  wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>>> feature requires a certain mix, ordering and alignments in
>>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>>
>>>>>> I am asking the community to review this high level implementation
>>>>>> design and give me direction or advice.
>>>>>>
>>>>>> The new hardware issues two windows of the size N bytes of
>>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>>> windows have the right combination of instructions or alignments. Our
>>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>>> alignments.
>>>>>>
>>>>>> Here is a summary of the most important requirements:
>>>>>>
>>>>>> a) Maximum of N instructions per window.
>>>>>> b) An instruction may cross the first window.
>>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>>    stores .
>>>>>> d) The total number of immediate constants in the instructions
>>>>>>    of a window should not exceed k.
>>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>>> f) A Window set terminates when a branch exists in a window.
>>>>>> g) The number of allowed prefixes varies for instructions.
>>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>>
>>>>>> We have the following implementation plan for GCC:
>>>>>>
>>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>>    once before and once after register allocation as usual. In both
>>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>>    instruction scheduling.
>>>>>>
>>>>>> The advantage of doing it before register allocation is avoiding
>>>>>> extra dependencies caused by register allocation which may become
>>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>>> it after register allocation is a consideration for spilling code
>>>>>> which may be generated by the register allocator.
>>>>>>
>>>>>> The algorithm we use is:
>>>>>>
>>>>>> a) Considering the current dispatch window set, choose the first
>>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>>
>>>>>> 2) Insertion of alignment code.
>>>>>>
>>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>>> information such as sizes of branches are unknown until assembly or
>>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>>> exists in the assembler.
>>>>>>
>>>>>> There are two possible approaches to solve alignment problem.
>>>>>>
>>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>>     pass in GCC.
>>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>>
>>>>>> I appreciate your comments on the proposed implementation procedure
>>>>>> and the choices a or b above.
>>>>>>
>>>>
>>>> I don't this should be done in assembler. Assembler should just assemble
>>>> the assembly input.
>>>
>>> That adds quite a bit of complication to the compiler though -- getting the
>>> instruction lengths right (and thus proper packing & alignment) can be
>>> extremely difficult.  I did some experiments with this on a target with
>>> *fixed* instruction lengths a while back and even though the port tried hard
>>> to get lengths right, it would routinely miss something.  Ultimately I
>>> decided that it forcing the compiler to know instruction lengths with a very
>>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>>> instruction lengths just adds yet another complexity to the situation.  Then
>>> add the complication of needing to add specific prefixes or nops and it just
>>> gets downright ugly.
>>>
>>> I'd probably approach this by having the compiler emit a directive which
>>> states what the desired alignment at a particular point should be, then
>>> allow the assembler to select the best method to get the desired alignment.
>>
>> Jeff,
>>
>> This is exactly part of our binutils side of the proposal, which I'll
>> outline now
>>
>> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
>> a) multiple prefixes are benign in certain modes and are thus chosen for padding
>> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
>> more explicit (and they don't trigger a call to
>> md_flush_pending_output)
>>
>> 2. Add new pseudo-op to delineate alignment boundaries.  This is
>> needed to signal any dispatch engine (below) to pad.  Here are my top
>> two candidates, any feedback is appreciated:
>> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
>> b) ".padalign" which calla a new "md_pad_align()"
>>
>> 3. Add dispatch optimization infrastructure which
>> a) is guarded by -mtune flag (and possibly other -f style flags)
>> b) tracks assembled instruction attributes and their fragments
>> c) can pad (insert benign prefixes) into previously assembled fragments
>> d) maintains dispatch engine state (according to some subset of Reza's rules)
>>
>> Discussion:
>>
>> The flags in 3a) should guard against these changes affecting current behavior.
>>
>> The assembly tracking in 3b) is for bookkeeping only; the padding in
>> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
>> the dispatch engine in 3d) signals.
>>
>> For compilers that know exactly how to pad for the new processor, the
>> ability to
>> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>>
>> For assembly programs and/or compilers that don't choose to do any
>> dispatch optimization, it's anticipated that the engine in 3d) would
>> be useful for optimizing for -mtune=bdver1
>>
>> I'll post patches for these soon.
>
> Can you do it with directives only?

In theory, if the compiler knows all sizes and offsets, yes (given
some way to add multiple prefixes).

However in practice, no.

To get  GCC to know all would require replicating most assembler
functionality in  GCC, including parsing, assembling, and sizing
(parts of output_insn() and its child output_*() functions).  We
considered exposing one-line assembly as a library but you have to
provide (or reuse) the segment/frchain/fragment context, and I don't
think introducing a GCC->binutils dependency (other than runtime)
would be easy to introduce into the community.

This wouldn't cover the assembly language case either.

And remember, even if you have all the directives (and the
programmer/compiler knows all), the assembler must remember potential
padding locations until the decision (and knowledge about how) to pad
arrives.

--
Quentin Neill
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

H.J. Lu-30
On Thu, Jun 10, 2010 at 3:09 PM, Quentin Neill
<[hidden email]> wrote:

> On Thu, Jun 10, 2010 at 4:08 PM, H.J. Lu <[hidden email]> wrote:
>> On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
>> <[hidden email]> wrote:
>>> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <[hidden email]> wrote:
>>>> On 06/10/10 13:52, H.J. Lu wrote:
>>>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>>>> <[hidden email]>  wrote:
>>>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>>>> is relevant - s ee the last few paragraphs regarding how to
>>>>>> "solve the alignment problem".
>>>>>>
>>>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>>>
>>>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<[hidden email]>
>>>>>>  wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>>>> feature requires a certain mix, ordering and alignments in
>>>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>>>
>>>>>>> I am asking the community to review this high level implementation
>>>>>>> design and give me direction or advice.
>>>>>>>
>>>>>>> The new hardware issues two windows of the size N bytes of
>>>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>>>> windows have the right combination of instructions or alignments. Our
>>>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>>>> alignments.
>>>>>>>
>>>>>>> Here is a summary of the most important requirements:
>>>>>>>
>>>>>>> a) Maximum of N instructions per window.
>>>>>>> b) An instruction may cross the first window.
>>>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>>>    stores .
>>>>>>> d) The total number of immediate constants in the instructions
>>>>>>>    of a window should not exceed k.
>>>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>>>> f) A Window set terminates when a branch exists in a window.
>>>>>>> g) The number of allowed prefixes varies for instructions.
>>>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>>>
>>>>>>> We have the following implementation plan for GCC:
>>>>>>>
>>>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>>>    once before and once after register allocation as usual. In both
>>>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>>>    instruction scheduling.
>>>>>>>
>>>>>>> The advantage of doing it before register allocation is avoiding
>>>>>>> extra dependencies caused by register allocation which may become
>>>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>>>> it after register allocation is a consideration for spilling code
>>>>>>> which may be generated by the register allocator.
>>>>>>>
>>>>>>> The algorithm we use is:
>>>>>>>
>>>>>>> a) Considering the current dispatch window set, choose the first
>>>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>>>
>>>>>>> 2) Insertion of alignment code.
>>>>>>>
>>>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>>>> information such as sizes of branches are unknown until assembly or
>>>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>>>> exists in the assembler.
>>>>>>>
>>>>>>> There are two possible approaches to solve alignment problem.
>>>>>>>
>>>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>>>     pass in GCC.
>>>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>>>
>>>>>>> I appreciate your comments on the proposed implementation procedure
>>>>>>> and the choices a or b above.
>>>>>>>
>>>>>
>>>>> I don't this should be done in assembler. Assembler should just assemble
>>>>> the assembly input.
>>>>
>>>> That adds quite a bit of complication to the compiler though -- getting the
>>>> instruction lengths right (and thus proper packing & alignment) can be
>>>> extremely difficult.  I did some experiments with this on a target with
>>>> *fixed* instruction lengths a while back and even though the port tried hard
>>>> to get lengths right, it would routinely miss something.  Ultimately I
>>>> decided that it forcing the compiler to know instruction lengths with a very
>>>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>>>> instruction lengths just adds yet another complexity to the situation.  Then
>>>> add the complication of needing to add specific prefixes or nops and it just
>>>> gets downright ugly.
>>>>
>>>> I'd probably approach this by having the compiler emit a directive which
>>>> states what the desired alignment at a particular point should be, then
>>>> allow the assembler to select the best method to get the desired alignment.
>>>
>>> Jeff,
>>>
>>> This is exactly part of our binutils side of the proposal, which I'll
>>> outline now
>>>
>>> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
>>> a) multiple prefixes are benign in certain modes and are thus chosen for padding
>>> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
>>> more explicit (and they don't trigger a call to
>>> md_flush_pending_output)
>>>
>>> 2. Add new pseudo-op to delineate alignment boundaries.  This is
>>> needed to signal any dispatch engine (below) to pad.  Here are my top
>>> two candidates, any feedback is appreciated:
>>> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
>>> b) ".padalign" which calla a new "md_pad_align()"
>>>
>>> 3. Add dispatch optimization infrastructure which
>>> a) is guarded by -mtune flag (and possibly other -f style flags)
>>> b) tracks assembled instruction attributes and their fragments
>>> c) can pad (insert benign prefixes) into previously assembled fragments
>>> d) maintains dispatch engine state (according to some subset of Reza's rules)
>>>
>>> Discussion:
>>>
>>> The flags in 3a) should guard against these changes affecting current behavior.
>>>
>>> The assembly tracking in 3b) is for bookkeeping only; the padding in
>>> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
>>> the dispatch engine in 3d) signals.
>>>
>>> For compilers that know exactly how to pad for the new processor, the
>>> ability to
>>> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>>>
>>> For assembly programs and/or compilers that don't choose to do any
>>> dispatch optimization, it's anticipated that the engine in 3d) would
>>> be useful for optimizing for -mtune=bdver1
>>>
>>> I'll post patches for these soon.
>>
>> Can you do it with directives only?
>
> In theory, if the compiler knows all sizes and offsets, yes (given
> some way to add multiple prefixes).
>
> However in practice, no.
>
> To get  GCC to know all would require replicating most assembler
> functionality in  GCC, including parsing, assembling, and sizing
> (parts of output_insn() and its child output_*() functions).  We
> considered exposing one-line assembly as a library but you have to
> provide (or reuse) the segment/frchain/fragment context, and I don't
> think introducing a GCC->binutils dependency (other than runtime)
> would be easy to introduce into the community.
>
> This wouldn't cover the assembly language case either.
>
> And remember, even if you have all the directives (and the
> programmer/compiler knows all), the assembler must remember potential
> padding locations until the decision (and knowledge about how) to pad
> arrives.
>

x86 assembler isn't an optimizing assembler. -mtune only does
instruction selection.  What you are proposing sounds like an optimizing
assembler to me. Are we going to support scheduling, macro, ...?


--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Daniel Jacobowitz-3
In reply to this post by Jeff Law
On Thu, Jun 10, 2010 at 02:03:03PM -0600, Jeff Law wrote:
> That adds quite a bit of complication to the compiler though --
> getting the instruction lengths right (and thus proper packing &
> alignment) can be extremely difficult.  I did some experiments with
> this on a target with *fixed* instruction lengths a while back and
> even though the port tried hard to get lengths right, it would
> routinely miss something.  Ultimately I decided that it forcing the
> compiler to know instruction lengths with a very high degree of
> accuracy wasn't a sane thing to do.

FWIW, my opinion (and I think Jakub has expressed a similar opinion
and/or tool in the past) is that there is a sane way to do this: put
assertions in the assembler output and have the assembler validate
them.

On the other hand, I'm not going to argue that it's a lot of work.

--
Daniel Jacobowitz
CodeSourcery
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Quentin Neill
On Thu, Jun 10, 2010 at 5:40 PM, Daniel Jacobowitz <[hidden email]> wrote:

> On Thu, Jun 10, 2010 at 02:03:03PM -0600, Jeff Law wrote:
>> That adds quite a bit of complication to the compiler though --
>> getting the instruction lengths right (and thus proper packing &
>> alignment) can be extremely difficult.  I did some experiments with
>> this on a target with *fixed* instruction lengths a while back and
>> even though the port tried hard to get lengths right, it would
>> routinely miss something.  Ultimately I decided that it forcing the
>> compiler to know instruction lengths with a very high degree of
>> accuracy wasn't a sane thing to do.
>
> FWIW, my opinion (and I think Jakub has expressed a similar opinion
> and/or tool in the past) is that there is a sane way to do this: put
> assertions in the assembler output and have the assembler validate
> them.
>
> On the other hand, I'm not going to argue that it's a lot of work.
> --
> Daniel Jacobowitz
> CodeSourcery

When you say "put assertions in the assembler output" I understood it
to mean "in the assembly source code output by the compiler", not "the
output produced by the assembler".

Does this qualify as a form of what you are suggesting?  Because this
is exactly what is being proposed:

.balign 8                  # start window
    insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
    insn2 op, op        # AA BB CC
.padalign 8              # window boundary
    insn4 op
    . . .

--
Quentin Neill
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Daniel Jacobowitz-3
On Thu, Jun 10, 2010 at 09:48:24PM -0500, Quentin Neill wrote:
> > On the other hand, I'm not going to argue that it's a lot of work.

Missing "not" !

> When you say "put assertions in the assembler output" I understood it
> to mean "in the assembly source code output by the compiler", not "the
> output produced by the assembler".

Yes.

> Does this qualify as a form of what you are suggesting?  Because this
> is exactly what is being proposed:
>
> .balign 8                  # start window
>     insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
>     insn2 op, op        # AA BB CC
> .padalign 8              # window boundary
>     insn4 op
>     . . .

No, this is quite different.  These are directives that tell the
assembler to make changes.  I'm talking about assertions, not
directives.  Something like this:

  mov r0, r1 @ [length 2]
  add ip, lr, ip @ [length 4]
  mov r0, r1 @ [length 4] <-- assembler error 'insn has length 2'

GCC can output length information, but it is never exact, and it is
not in a form recognized by the assembler.

On x86, I have no idea how this would work.

--
Daniel Jacobowitz
CodeSourcery
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Quentin Neill
On Fri, Jun 11, 2010 at 10:58 AM, Daniel Jacobowitz
<[hidden email]> wrote:
> On Thu, Jun 10, 2010 at 09:48:24PM -0500, Quentin Neill wrote:
[snip]

>> Does this qualify as a form of what you are suggesting?  Because this
>> is exactly what is being proposed:
>>
>> .balign 8                  # start window
>>     insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
>>     insn2 op, op        # AA BB CC
>> .padalign 8              # window boundary
>>     insn4 op
>>     . . .
>
> No, this is quite different.  These are directives that tell the
> assembler to make changes.  I'm talking about assertions, not
> directives.  Something like this:
>
>  mov r0, r1 @ [length 2]
>  add ip, lr, ip @ [length 4]
>  mov r0, r1 @ [length 4] <-- assembler error 'insn has length 2'
>
> GCC can output length information, but it is never exact, and it is
> not in a form recognized by the assembler.
>
> On x86, I have no idea how this would work.
>
> --
> Daniel Jacobowitz
> CodeSourcery
>

I see.

Currently GCC doesn't compute the current encoding offset (doesn't
know mnemonic/opcode lengths), nor does it perform a relaxation pass
(to resolve forward displacement/branch offsets).   Without these it
so cannot accurately formulate such assertions.

Our proposal is to let the assembler itself (knowing best the details
of the encoding stream, offsets, and the processor) aligns
instructions, with hints about the structure (block starts, ends,
instruction sets) using macros/assertions/tokens if needed.

Another option would be to expose some subset of the assembler
functionality as a plugin to GCC (similar to how gold is used) to
extract the instruction sizes.   Any comments on that approach?

--
Quentin Neill
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

H.J. Lu-30
On Fri, Jun 11, 2010 at 12:09 PM, Quentin Neill
<[hidden email]> wrote:

> On Fri, Jun 11, 2010 at 10:58 AM, Daniel Jacobowitz
> <[hidden email]> wrote:
>> On Thu, Jun 10, 2010 at 09:48:24PM -0500, Quentin Neill wrote:
> [snip]
>>> Does this qualify as a form of what you are suggesting?  Because this
>>> is exactly what is being proposed:
>>>
>>> .balign 8                  # start window
>>>     insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
>>>     insn2 op, op        # AA BB CC
>>> .padalign 8              # window boundary
>>>     insn4 op
>>>     . . .
>>
>> No, this is quite different.  These are directives that tell the
>> assembler to make changes.  I'm talking about assertions, not
>> directives.  Something like this:
>>
>>  mov r0, r1 @ [length 2]
>>  add ip, lr, ip @ [length 4]
>>  mov r0, r1 @ [length 4] <-- assembler error 'insn has length 2'
>>
>> GCC can output length information, but it is never exact, and it is
>> not in a form recognized by the assembler.
>>
>> On x86, I have no idea how this would work.
>>
>> --
>> Daniel Jacobowitz
>> CodeSourcery
>>
>
> I see.
>
> Currently GCC doesn't compute the current encoding offset (doesn't
> know mnemonic/opcode lengths), nor does it perform a relaxation pass
> (to resolve forward displacement/branch offsets).   Without these it
> so cannot accurately formulate such assertions.
>
> Our proposal is to let the assembler itself (knowing best the details
> of the encoding stream, offsets, and the processor) aligns
> instructions, with hints about the structure (block starts, ends,
> instruction sets) using macros/assertions/tokens if needed.
>
> Another option would be to expose some subset of the assembler
> functionality as a plugin to GCC (similar to how gold is used) to
> extract the instruction sizes.   Any comments on that approach?
>

I would suggest generating object code directly, totally bypassing
assembler. Many compilers do it. But it is a HUGE effort.


--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Jakub Jelinek
In reply to this post by Quentin Neill
On Fri, Jun 11, 2010 at 02:09:33PM -0500, Quentin Neill wrote:
> Currently GCC doesn't compute the current encoding offset (doesn't
> know mnemonic/opcode lengths),

That's not true, gcc for i?86/x86_64 actually calculates the length and for
most of the commonly used insns correctly, I've spent some time fixing
various bugs in it a year ago, see
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg01808.html
and the thread around it.

Many of the remaining few issues (haven't tested bdver ISA additions for
lengths) are fixable too, of course there is always inline asm where
GCC can't know.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Quentin Neill
In reply to this post by H.J. Lu-30
On Thu, Jun 10, 2010 at 5:23 PM, H.J. Lu <[hidden email]> wrote:
> [snip]
> x86 assembler isn't an optimizing assembler. -mtune only does
> instruction selection.  What you are proposing sounds like an optimizing
> assembler to me. Are we going to support scheduling, macro, ...?
> --
> H.J.

Just to clarify, we are not doing scheduling or macros.   The
assembler already supported alignment and padding using .align and
friends, which can be from the compiler and from hand-written
assembly.

Now we are seeing more complex alignment rules that are not as simple
as it used to be for the older hardware.  It will be almost impossible
for an assembly programmer to insert the right directives, not to
mention any change might invalidate previous alignments.   Assembly
programmers will be out of luck (that is, unless the compiler becomes
the assembler).

The essence is we want to insert prefixes (as well as nops) according
to certain rules known at encoding time.  The mechanism implementing
these rules can be abstracted (table driven?) and could be applicable
to any hardware having similar features.

As gcc does not currently encode and/or generate object code, we are
wary of introducing such assembler functionality and want to avoid if
possible, instead leveraging the existing binutils infrastructure.

--
Quentin Neill (with some input from Reza Yazdani)
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Andi Kleen-3
In reply to this post by Quentin Neill
Quentin Neill <[hidden email]> writes:
>
> Another option would be to expose some subset of the assembler
> functionality as a plugin to GCC (similar to how gold is used) to
> extract the instruction sizes.   Any comments on that approach?

AFAIK gcc already does keep track of instruction lengths
(e.g. for LOOP), but it may not be fully reliable.

But if you need more why can't you just link the whole assembler
into gcc? That would hopefully speed up compilation too
(e.g. over time the text generation of instructions could
be bypassed)

I don't know how hard it would be, but it would seem like the
right thing to do.

-Andi
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

H.J. Lu-30
In reply to this post by Quentin Neill
On Fri, Jun 11, 2010 at 3:42 PM, Quentin Neill
<[hidden email]> wrote:

> On Thu, Jun 10, 2010 at 5:23 PM, H.J. Lu <[hidden email]> wrote:
>> [snip]
>> x86 assembler isn't an optimizing assembler. -mtune only does
>> instruction selection.  What you are proposing sounds like an optimizing
>> assembler to me. Are we going to support scheduling, macro, ...?
>> --
>> H.J.
>
> Just to clarify, we are not doing scheduling or macros.   The
> assembler already supported alignment and padding using .align and
> friends, which can be from the compiler and from hand-written
> assembly.
>
> Now we are seeing more complex alignment rules that are not as simple
> as it used to be for the older hardware.  It will be almost impossible
> for an assembly programmer to insert the right directives, not to
> mention any change might invalidate previous alignments.   Assembly
> programmers will be out of luck (that is, unless the compiler becomes
> the assembler).

If you can find a way to help assembly programmers via new directives,
it is great.  GNU x86 assembler should just translate assembly code
into binary code. The output of "objdump -d" should be identical
to the input assembly.

We shouldn't turn GNU x86 assembler into an optimizing assembler.
Next people may ask assembler to remove redundant instructions, ...

Right now, when something goes wrong, people don't have to debug
assembler since it is very unlikely that the problem is in assembler.
When assembler starts to make changes to assembly input, we have
another place where a bug may be introduced.

>
> The essence is we want to insert prefixes (as well as nops) according
> to certain rules known at encoding time.  The mechanism implementing
> these rules can be abstracted (table driven?) and could be applicable
> to any hardware having similar features.

Can you implement them with new directives/pseudo instructions?

BTW, GCC should know the instruction length. If not, it is a GCC bug.


--
H.J.
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Ian Lance Taylor-3
In reply to this post by Andi Kleen-3
Andi Kleen <[hidden email]> writes:

> But if you need more why can't you just link the whole assembler
> into gcc? That would hopefully speed up compilation too
> (e.g. over time the text generation of instructions could
> be bypassed)

It would help compilation time a little bit, but generating the
assembly code and running the entire assembler is a fairly small
percentage of the overall compilation time--e.g., 3%.  It's worth
doing a fair amount of work to speed up compilation by 3%, but linking
the assembler into gcc would be an enormous amount of work.  I would
certainly support somebody who wants to tackle it, but I don't think
it's very high up the priority list for the overall project.

Ian
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Andi Kleen-3
> It would help compilation time a little bit, but generating the
> assembly code and running the entire assembler is a fairly small
> percentage of the overall compilation time--e.g., 3%.  It's worth
> doing a fair amount of work to speed up compilation by 3%, but linking
> the assembler into gcc would be an enormous amount of work.  I would

Curious -- why do you think it would be that much work?

I admit I haven't looked into gas code, but naively it can't
be all that difficult to e.g. run gas as a thread and
pass the text input through some shared memory buffer?

That would likely not speed up thinks too much, but
it could be a starting for more short cuts.

-Andi
Reply | Threaded
Open this post in threaded view
|

Re: Scheduling x86 dispatch windows

Joern Rennecke-4
Quoting Andi Kleen <[hidden email]>:
> I admit I haven't looked into gas code, but naively it can't
> be all that difficult to e.g. run gas as a thread and
> pass the text input through some shared memory buffer?

If you are generating text anyway, there should be little difference to
the existing -pipe option - at least on operating systems that can handle
processes efficiently.
12