[PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ryan S. Arnold
On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <[hidden email]> wrote:
> We have one, it's the glibc microbenchmark, and we want to expand it,
> otherwise when ACME comes with their patch for ARM and breaks performance
> for targets that Linaro cares about I have no way to reject the patch
> objectively :-)

Can you be objective in analyzing performance when two different
people have differing opinions on what performance preconditions
should be coded against?

There are some cases that are obvious.. we know that from pipeline
analysis that certain instruction sequences can hinder performance.
That is objective and can be measured by a benchmark, but saying that
a particular change penalizes X sized copies but helps Y sized copies
when there are no published performance preconditions isn't.  It's a
difference in opinion of what's important.

PowerPC has had the luxury of not having their performance
pre-conditions contested.  PowerPC string performance is optimized
based upon customer data-set analysis.  So PowerPC's preconditions are
pretty concrete...  Optimize for aligned data in excess of 128-bytes
(I believe).

> You need to statistically analyze the numbers, assign weights to ranges,
> and come up with some kind of number that evaluates the results based
> on *some* formula. That is the only way we are going to keep moving
> performance forward (against some kind of criteria).

This sounds like establishing preconditions (what types of data will
be optimized for).

Unless technology evolves that you can statistically analyze data in
real time and adjust the implementation based on what you find (an
implementation with a different set of preconditions) to account for
this you're going to end up with a lot of in-fighting over
performance.

I've run into situations where I recommended that a customer code
their own string function implementation because they continually
encountered unaligned-data when copying-by-value in C++ functions and
PowerPC's string function implementations penalized unaligned copies
in preference for aligned copies.

Ryan
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ryan S. Arnold
In reply to this post by Ondřej Bílka
On Tue, Sep 3, 2013 at 12:37 PM, Ondřej Bílka <[hidden email]> wrote:
>> I disagree strongly. You *must* come up with a measurable answer and
>> looking at a graph is never a solution I'm going to accept.
>>
> You can have that opinion.
> Looking at performance graphs is most powerful technique how to
> understand performance. I got most of my improvements from analyzing
> these.

Are there any open source pipeline analysis tools?  I've found the one
I've used (proprietary) to be a pretty good indicator of general
instruction selection optimization/correctness.

Graphs are useful if everyone's agreed which data-sizes and alignment
restrictions should be optimized for a particular platform.

Ryan
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
In reply to this post by Ryan S. Arnold
On 09/03/2013 03:31 PM, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <[hidden email]> wrote:
>> We have one, it's the glibc microbenchmark, and we want to expand it,
>> otherwise when ACME comes with their patch for ARM and breaks performance
>> for targets that Linaro cares about I have no way to reject the patch
>> objectively :-)
>
> Can you be objective in analyzing performance when two different
> people have differing opinions on what performance preconditions
> should be coded against?

No.

> There are some cases that are obvious.. we know that from pipeline
> analysis that certain instruction sequences can hinder performance.
> That is objective and can be measured by a benchmark, but saying that
> a particular change penalizes X sized copies but helps Y sized copies
> when there are no published performance preconditions isn't.  It's a
> difference in opinion of what's important.

I agree. The project needs to adopt some set of performance preconditions
and document them and defend them.

If we can't defend these positions then we will never be able to evaluate
any performance patches. We will see-saw between several implementations
over the years.

The current set of performance preconditions are baked into the experience
of the core developers reviewing patches. I want the experts out of the
loop.

> PowerPC has had the luxury of not having their performance
> pre-conditions contested.  PowerPC string performance is optimized
> based upon customer data-set analysis.  So PowerPC's preconditions are
> pretty concrete...  Optimize for aligned data in excess of 128-bytes
> (I believe).

We should be documenting this somewhere, preferably in a Power-specific
test that looks at just this kind of issue.

Documenting this statically is the first, in my opinion, stepping stone
to having something like dynamic feedback.

>> You need to statistically analyze the numbers, assign weights to ranges,
>> and come up with some kind of number that evaluates the results based
>> on *some* formula. That is the only way we are going to keep moving
>> performance forward (against some kind of criteria).
>
> This sounds like establishing preconditions (what types of data will
> be optimized for).

I agree. We need it.

> Unless technology evolves that you can statistically analyze data in
> real time and adjust the implementation based on what you find (an
> implementation with a different set of preconditions) to account for
> this you're going to end up with a lot of in-fighting over
> performance.

Why do you assume we'll have a lot of in-fighting over performance?

At present we've split the performance intensive (or so we believe)
routines on a per-machine basis. The arguments are then going to be
had only on a per-machine basis, and even then for each hardware
variant can have an IFUNC resolver select the right routine at
runtime.

Then we come upon the tunables that should allow some dynamic adjustment
of an algorithm based on realtime data.

> I've run into situations where I recommended that a customer code
> their own string function implementation because they continually
> encountered unaligned-data when copying-by-value in C++ functions and
> PowerPC's string function implementations penalized unaligned copies
> in preference for aligned copies.

Provide both in glibc and expose a tunable?

Cheers,
Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ryan S. Arnold
On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <[hidden email]> wrote:
> The current set of performance preconditions are baked into the experience
> of the core developers reviewing patches. I want the experts out of the
> loop.
>

This is the clutch.

Developers working for the CPU manufacturers are privy to a lot of
unpublished timing, penalty/hazard information, as well as proprietary
pipeline analysis tools.

Will "J. Random Hacker" working for MegaCorp tell you that the reason
he's chosen a particular instruction sequence is because the system
he's working on has a tiny branch cache (the size of which might be
unpublished)?

>> PowerPC has had the luxury of not having their performance
>> pre-conditions contested.  PowerPC string performance is optimized
>> based upon customer data-set analysis.  So PowerPC's preconditions are
>> pretty concrete...  Optimize for aligned data in excess of 128-bytes
>> (I believe).
>
> We should be documenting this somewhere, preferably in a Power-specific
> test that looks at just this kind of issue.

I might be mistaken, but I think you'll find these preconditions
explicitly defined in the string function implementation source files
for PowerPC.

> Documenting this statically is the first, in my opinion, stepping stone
> to having something like dynamic feedback.

Absolutely!

>> Unless technology evolves that you can statistically analyze data in
>> real time and adjust the implementation based on what you find (an
>> implementation with a different set of preconditions) to account for
>> this you're going to end up with a lot of in-fighting over
>> performance.
>
> Why do you assume we'll have a lot of in-fighting over performance?

I'm projecting here.  If someone proposed to adjust the PowerPC
optimized string functions to their own preconditions and it
drastically changed the performance of existing customers, or future
customers you'd see me panic.

> At present we've split the performance intensive (or so we believe)
> routines on a per-machine basis. The arguments are then going to be
> had only on a per-machine basis, and even then for each hardware
> variant can have an IFUNC resolver select the right routine at
> runtime.

Right, selecting the right variant with IFUNC has certainly helped
platforms that didn't use optimized libraries.  This is the low
hanging fruit.  So now our concern is the proliferation of micro-tuned
variants and a lack of qualified eyes to objectively review the
patches.

> Then we come upon the tunables that should allow some dynamic adjustment
> of an algorithm based on realtime data.

Yes, you can do this with tunables if the developer knows something
about the data (more about that later).

>> I've run into situations where I recommended that a customer code
>> their own string function implementation because they continually
>> encountered unaligned-data when copying-by-value in C++ functions and
>> PowerPC's string function implementations penalized unaligned copies
>> in preference for aligned copies.
>
> Provide both in glibc and expose a tunable?

So do we (the glibc community) no longer consider the proliferation of
tunables to be a mortal sin?  Or was that only with regard to
configuration options?  Regardless, it still burdens the Linux
distributions and developers who have to provide QA.

If tunables are available, then trial-and-error would help where a
user doesn't know the particulars of his application's data usage.

Using tunables is potentially problematic as well.  Often testing a
condition in highly optimized code is enough to obviate the
performance benefit you're attempting to provide. Checking for feature
availability might consume enough cycles to make it senseless to use
the facility itself.  I believe this is what happened in the early
days trying to use VMX in string routines.

Additionally, while dynamically linked applications won't suffer from
using IFUNC resolved functions (because of mandatory PLT usage), glibc
internal usage of IFUNC resolved functions very likely will if/when
forced to go through the PLT, especially on systems like PowerPC where
indirect branching is more expensive than direct branching.  When
Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
keeping a 'generic' optimized version for internal libc usage.  We'll
see how it all works together.

So using tunables alone isn't necessarily a win unless it's coupled
with IFUNC.  But using IFUNC also isn't a guaranteed win in all cases.

For external usage, Using IFUNC in combination with a tunable should
be beneficial.  For instance, on systems that don't have a concrete
cacheline size (e.g., the A2 processor), at process initialization we
query the system cacheline size, populate a static with the size, and
then the string routines will query that size at runtime.  It'd be
nice to do that query at initialization and then pre-select an
implementation based on cacheline size so we don't have to test for
the cacheline size each time through the string function.

This of course increases the cost of maintaining the string routines
by having myriad of combinations.

These are all the trade-offs we weigh.

Ryan
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
In reply to this post by Ryan S. Arnold
On Tue, Sep 03, 2013 at 02:31:14PM -0500, Ryan S. Arnold wrote:

> On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <[hidden email]> wrote:
> > We have one, it's the glibc microbenchmark, and we want to expand it,
> > otherwise when ACME comes with their patch for ARM and breaks performance
> > for targets that Linaro cares about I have no way to reject the patch
> > objectively :-)
>
> Can you be objective in analyzing performance when two different
> people have differing opinions on what performance preconditions
> should be coded against?
>
I fear more situations like when google and facebook will send their
implementation which saves each milion dollars in energy bill over others
implementation.

> > You need to statistically analyze the numbers, assign weights to ranges,
> > and come up with some kind of number that evaluates the results based
> > on *some* formula. That is the only way we are going to keep moving
> > performance forward (against some kind of criteria).
>
> This sounds like establishing preconditions (what types of data will
> be optimized for).
>
> Unless technology evolves that you can statistically analyze data in
> real time and adjust the implementation based on what you find (an
> implementation with a different set of preconditions) to account for
> this you're going to end up with a lot of in-fighting over
> performance.
>
Technology is there at least for x64, its political problem.

You do not need real time analysis most of time. When user accepts
possibility performance could be worse by constant factor then profiling
becomes easier. We could rely on data from previous runs of program
instead.

First step would be to replace ifunc selection by first running
benchmarks on processor and then creating hash table with numbers of
fastest implementation and selection will be done by this lookup.

My profiler could(modulo externalities) for each program measure
 which implementation is fastest on given processor.

Then AFAIR appending data to elf is legal so we could append hash table
with program specific numbers to binary.

A ifunc resolver will first look to end of binary if there is ifunc table.
This could be implemented that worst that can happen for false positive
is selecting slow implementation. If there is entry for current processor it
will use it, otherwise it will look to table at end of libc.so and
repeat this step and if it was not found it will use default implementation.

This is doable, problems is motivate somebody to do profiling. My
approach tries to it possible for programmers (write testing, run it at
different machines and assemble), distributions (find enough users that
will agree to have sometimes turned profiling on and sending results.)
or end users who could turn profiling on to learn their patterns.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
In reply to this post by Ryan S. Arnold
On Tue, Sep 03, 2013 at 03:56:29PM -0500, Ryan S. Arnold wrote:

> On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <[hidden email]> wrote:
> > The current set of performance preconditions are baked into the experience
> > of the core developers reviewing patches. I want the experts out of the
> > loop.
> >
>
> This is the clutch.
>
> Developers working for the CPU manufacturers are privy to a lot of
> unpublished timing, penalty/hazard information, as well as proprietary
> pipeline analysis tools.
>
One problem of those on x64 is how much are these informations complete.
At least based on reading AMD optimization manuals around half of advices
is inaccurate so I need to check if they hold anyway.

> > At present we've split the performance intensive (or so we believe)
> > routines on a per-machine basis. The arguments are then going to be
> > had only on a per-machine basis, and even then for each hardware
> > variant can have an IFUNC resolver select the right routine at
> > runtime.
>
> Right, selecting the right variant with IFUNC has certainly helped
> platforms that didn't use optimized libraries.  This is the low
> hanging fruit.  So now our concern is the proliferation of micro-tuned
> variants and a lack of qualified eyes to objectively review the
> patches.
>
Not there yet, at least for AMD, where its often using ifunc to select
slowest implementation. When you look at x64 benchmarks relation between
asymptoticaly best and selected implementation is quite random.

> > Then we come upon the tunables that should allow some dynamic adjustment
> > of an algorithm based on realtime data.
>
> Yes, you can do this with tunables if the developer knows something
> about the data (more about that later).
>
> >> I've run into situations where I recommended that a customer code
> >> their own string function implementation because they continually
> >> encountered unaligned-data when copying-by-value in C++ functions and
> >> PowerPC's string function implementations penalized unaligned copies
> >> in preference for aligned copies.
> >
> > Provide both in glibc and expose a tunable?
>
> So do we (the glibc community) no longer consider the proliferation of
> tunables to be a mortal sin?  Or was that only with regard to
> configuration options?  Regardless, it still burdens the Linux
> distributions and developers who have to provide QA.
>
> If tunables are available, then trial-and-error would help where a
> user doesn't know the particulars of his application's data usage.
>
Here auto-tuning as described in other mail would give better result.

Unless there data usage would not fit any of our variants in which case
a custom routine would be needed.
 

> Using tunables is potentially problematic as well.  Often testing a
> condition in highly optimized code is enough to obviate the
> performance benefit you're attempting to provide. Checking for feature
> availability might consume enough cycles to make it senseless to use
> the facility itself.  I believe this is what happened in the early
> days trying to use VMX in string routines.
>
> Additionally, while dynamically linked applications won't suffer from
> using IFUNC resolved functions (because of mandatory PLT usage), glibc
> internal usage of IFUNC resolved functions very likely will if/when
> forced to go through the PLT, especially on systems like PowerPC where
> indirect branching is more expensive than direct branching.  When
> Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
> keeping a 'generic' optimized version for internal libc usage.  We'll
> see how it all works together.
>
Those that are concerned about getting each bit of performance would
recompile everything with --march=target anyway. It would be nice to
select internal version based on --march.
 

> So using tunables alone isn't necessarily a win unless it's coupled
> with IFUNC.  But using IFUNC also isn't a guaranteed win in all cases.
>
> For external usage, Using IFUNC in combination with a tunable should
> be beneficial.  For instance, on systems that don't have a concrete
> cacheline size (e.g., the A2 processor), at process initialization we
> query the system cacheline size, populate a static with the size, and
> then the string routines will query that size at runtime.  It'd be
> nice to do that query at initialization and then pre-select an
> implementation based on cacheline size so we don't have to test for
> the cacheline size each time through the string function.
>
> This of course increases the cost of maintaining the string routines
> by having myriad of combinations.
>
> These are all the trade-offs we weigh.
>
> Ryan

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
In reply to this post by Ryan S. Arnold
On 09/03/2013 04:56 PM, Ryan S. Arnold wrote:

> On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <[hidden email]> wrote:
>> The current set of performance preconditions are baked into the experience
>> of the core developers reviewing patches. I want the experts out of the
>> loop.
>>
>
> This is the clutch.
>
> Developers working for the CPU manufacturers are privy to a lot of
> unpublished timing, penalty/hazard information, as well as proprietary
> pipeline analysis tools.
>
> Will "J. Random Hacker" working for MegaCorp tell you that the reason
> he's chosen a particular instruction sequence is because the system
> he's working on has a tiny branch cache (the size of which might be
> unpublished)?

That's an interesting point. I've seen similar things happen in *.md
file generation in gcc and had forgotten about it entirely. However,
in all such instances the developer can say "I am bound by confidentiality
not to reveal the reasons why I made my choices." We can document that.

The opposite point is also true in that we might actually tune an
implementation based on real-world data and have no idea why it behaves
optimally from an architectural perspective. In which case we have to
document "This implementation was tuned using data set X."

How are these two situations any different really?

I would accept patches in both cases, but I would like to see that
MegaCorp's patches don't change anything for the consumer level CPUs
we routinely employ.

>>> PowerPC has had the luxury of not having their performance
>>> pre-conditions contested.  PowerPC string performance is optimized
>>> based upon customer data-set analysis.  So PowerPC's preconditions are
>>> pretty concrete...  Optimize for aligned data in excess of 128-bytes
>>> (I believe).
>>
>> We should be documenting this somewhere, preferably in a Power-specific
>> test that looks at just this kind of issue.
>
> I might be mistaken, but I think you'll find these preconditions
> explicitly defined in the string function implementation source files
> for PowerPC.

Excellent. We should probably have some more central location for this
information in a developer's guide or internals guide.

>> Documenting this statically is the first, in my opinion, stepping stone
>> to having something like dynamic feedback.
>
> Absolutely!

Glad we agree.

>>> Unless technology evolves that you can statistically analyze data in
>>> real time and adjust the implementation based on what you find (an
>>> implementation with a different set of preconditions) to account for
>>> this you're going to end up with a lot of in-fighting over
>>> performance.
>>
>> Why do you assume we'll have a lot of in-fighting over performance?
>
> I'm projecting here.  If someone proposed to adjust the PowerPC
> optimized string functions to their own preconditions and it
> drastically changed the performance of existing customers, or future
> customers you'd see me panic.

At least with a microbenchmark you'd know the situation was about to
go sideways immediately after running the benchmark. Otherwise it might
be quite a while before you do profiling before a big tools release.

This is exactly the situation we are in right now for x86 and x86-64,
and will likely see with ARM/AArch64 and the plethora of vendor
implementations.

>> At present we've split the performance intensive (or so we believe)
>> routines on a per-machine basis. The arguments are then going to be
>> had only on a per-machine basis, and even then for each hardware
>> variant can have an IFUNC resolver select the right routine at
>> runtime.
>
> Right, selecting the right variant with IFUNC has certainly helped
> platforms that didn't use optimized libraries.  This is the low
> hanging fruit.  So now our concern is the proliferation of micro-tuned
> variants and a lack of qualified eyes to objectively review the
> patches.

Yes, that's a concern.

>> Then we come upon the tunables that should allow some dynamic adjustment
>> of an algorithm based on realtime data.
>
> Yes, you can do this with tunables if the developer knows something
> about the data (more about that later).

We can't always assume ignorance, but I agree that there are problems
here.

>>> I've run into situations where I recommended that a customer code
>>> their own string function implementation because they continually
>>> encountered unaligned-data when copying-by-value in C++ functions and
>>> PowerPC's string function implementations penalized unaligned copies
>>> in preference for aligned copies.
>>
>> Provide both in glibc and expose a tunable?
>
> So do we (the glibc community) no longer consider the proliferation of
> tunables to be a mortal sin?  Or was that only with regard to
> configuration options?  Regardless, it still burdens the Linux
> distributions and developers who have to provide QA.

We need to have a broader conversation about this very issue.

Pragmatically configuration and tunables are similar problems.

I think tunables are a mortal sin and would rather see algorithms
that can select the best implementation for the user automatically.
However, we need tunables to bootstrap that. I would hope that as
tunables appear they disappear into smart algorithms that do a better
job of tunning internals.

> If tunables are available, then trial-and-error would help where a
> user doesn't know the particulars of his application's data usage.

It would.

> Using tunables is potentially problematic as well.  Often testing a
> condition in highly optimized code is enough to obviate the
> performance benefit you're attempting to provide. Checking for feature
> availability might consume enough cycles to make it senseless to use
> the facility itself.  I believe this is what happened in the early
> days trying to use VMX in string routines.

Agreed. You have to make it configurable then, and that's costly from
a complexity perspective, and makes testing all build configurations
harder and harder.

> Additionally, while dynamically linked applications won't suffer from
> using IFUNC resolved functions (because of mandatory PLT usage), glibc
> internal usage of IFUNC resolved functions very likely will if/when
> forced to go through the PLT, especially on systems like PowerPC where
> indirect branching is more expensive than direct branching.  When
> Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
> keeping a 'generic' optimized version for internal libc usage.  We'll
> see how it all works together.

Good point.

> So using tunables alone isn't necessarily a win unless it's coupled
> with IFUNC.  But using IFUNC also isn't a guaranteed win in all cases.

Unless we try to come up wtih something to make it faster on Power?

> For external usage, Using IFUNC in combination with a tunable should
> be beneficial.  For instance, on systems that don't have a concrete
> cacheline size (e.g., the A2 processor), at process initialization we
> query the system cacheline size, populate a static with the size, and
> then the string routines will query that size at runtime.  It'd be
> nice to do that query at initialization and then pre-select an
> implementation based on cacheline size so we don't have to test for
> the cacheline size each time through the string function.

Yes.

> This of course increases the cost of maintaining the string routines
> by having myriad of combinations.

Which we already have and need to test via the IFUNC testing functionality
that HJ added.

> These are all the trade-offs we weigh.

... and more.

I see your concerns, and raise you a couple more.

If we leave the situation as is we will have a continued and difficult
time accepting performance patches for functional units of the library.

We need to drive objectivity into the evaluation of the patches. It won't
be completely objective at first, but it better move in that direction.

Cheers,
Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Siddhesh Poyarekar-3
In reply to this post by Carlos O'Donell-6
On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote:

> I agree. The eventual goal of the project is to have some kind of
> whole system benchmarking that allows users to feed in their profiles
> and allow us as developers to see what users are doing with our library.
>
> Just like CPU designers feed in a whole distribution of applications
> and look at the probability of instruction selection and tweak instruction
> to microcode mappings.
>
> I am willing to accept a certain error in the process as long as I know
> we are headed in the right direction. If we all disagree about the
> direction we are going in then we should talk about it.
>
> I see:
>
> microbenchmarks -> whole system benchmarks -> profile driven optimizations

I've mentioned this before - microbenchmarks are not a way to whole
system benchmarks in that they don't replace system benchmarks.  We
need to work on both in parallel because both have different goals.

A microbenchmark would have parameters such as alignment, size and
cache pressure to determine how an implementation scales.  These are
generic numbers (i.e. they're not tied to specific high level
workloads) that a developer can use to design their programs.

Whole system benchmarks however work at a different level.  They would
give an average case number that describes how a specific recipe
impacts performance of a set of programs.  An administrator would use
these to tweak the system for the workload.

> I would be happy to accept a patch that does:
> * Shows the benchmark numbers.
> * Explains relevant factors not caught by the benchmark that affect
>   performance, what they are, and why the patch should go in.
>
> My goal is to increase the quality of the written rationales for
> performance related submissions.

Agreed.  In fact, this should go in as a large comment in the
implementation itself.  Someone had mentioned in the past (was it
Torvald?) that every assembly implementation we write should be as
verbose in comments as it can possibly be so that there is no
ambiguity about the rationale for selection of specific instruction
sequences over others.

> >> If we have N tests and they produce N numbers, for a given target,
> >> for a given device, for a given workload, there is a set of importance
> >> weights on N that should give you some kind of relevance.
> >>
> > You are jumping to case when we will have these weights. Problematic
> > part is getting those.
>
> I agree.
>
> It's hard to know the weights without having an intuitive understanding
> of the applications you're running on your system and what's relevant
> for their performance.

1. Assume aligned input.  Nothing should take (any noticeable)
   performance away from align copies/moves
2. Scale with size
3. Provide acceptable performance for unaligned sizes without
   penalizing the aligned case
4. Measure the effect of dcache pressure on function performance
5. Measure effect of icache pressure on function performance.

Depending on the actual cost of cache misses on different processors,
the icache/dcache miss cost would either have higher or lower weight
but for 1-3, I'd go in that order of priorities with little concern
for unaligned cases.

Siddhesh
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote:

> On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote:
> > I agree. The eventual goal of the project is to have some kind of
> > whole system benchmarking that allows users to feed in their profiles
> > and allow us as developers to see what users are doing with our library.
> >
> > Just like CPU designers feed in a whole distribution of applications
> > and look at the probability of instruction selection and tweak instruction
> > to microcode mappings.
> >
> > I am willing to accept a certain error in the process as long as I know
> > we are headed in the right direction. If we all disagree about the
> > direction we are going in then we should talk about it.
> >
> > I see:
> >
> > microbenchmarks -> whole system benchmarks -> profile driven optimizations
>
> I've mentioned this before - microbenchmarks are not a way to whole
> system benchmarks in that they don't replace system benchmarks.  We
> need to work on both in parallel because both have different goals.
>
> A microbenchmark would have parameters such as alignment, size and
> cache pressure to determine how an implementation scales.  These are
> generic numbers (i.e. they're not tied to specific high level
> workloads) that a developer can use to design their programs.
>
> Whole system benchmarks however work at a different level.  They would
> give an average case number that describes how a specific recipe
> impacts performance of a set of programs.  An administrator would use
> these to tweak the system for the workload.
>
> > I would be happy to accept a patch that does:
> > * Shows the benchmark numbers.
> > * Explains relevant factors not caught by the benchmark that affect
> >   performance, what they are, and why the patch should go in.
> >
> > My goal is to increase the quality of the written rationales for
> > performance related submissions.
>
> Agreed.  In fact, this should go in as a large comment in the
> implementation itself.  Someone had mentioned in the past (was it
> Torvald?) that every assembly implementation we write should be as
> verbose in comments as it can possibly be so that there is no
> ambiguity about the rationale for selection of specific instruction
> sequences over others.
>
> > >> If we have N tests and they produce N numbers, for a given target,
> > >> for a given device, for a given workload, there is a set of importance
> > >> weights on N that should give you some kind of relevance.
> > >>
> > > You are jumping to case when we will have these weights. Problematic
> > > part is getting those.
> >
> > I agree.
> >
> > It's hard to know the weights without having an intuitive understanding
> > of the applications you're running on your system and what's relevant
> > for their performance.
>
> 1. Assume aligned input.  Nothing should take (any noticeable)
>    performance away from align copies/moves
Not very useful as this is extremely dependant on function measured. For
functions like strcmp and strlen alignments are mostly random so aligned
case does not say much. On opposite end of spectrum is memset which is
almost always 8 byte aligned and unaligned performance does not make lot
of sense.

> 2. Scale with size
Not very important for several reasons. One is that big sizes are cold
(just look in oprofile output that loops are less frequent than header.)

Second reason is that if we look at caller large sizes are unlikely
bottleneck.

One type of usage is find delimiter like:

i=strlen(n);
for (i=0;i<n;i++)
  something();

Here for n large a strlen contribution is likely small.

Or second one is skipping parts of string. Consider following

while(p=strchr(p+1,'a')){
  something();
}

If p was 1000 byte buffer then best case is when a is not there and we
do one 1000 byte strchr. Worst case is when string consist entirely of
a's and we need to call 1 byte strchr 1000 times.

> 3. Provide acceptable performance for unaligned sizes without
>    penalizing the aligned case

This is quite important case. It should be measured correctly, what is
important is that alignment varies. This can be slower than when you
pick fixed alignment and alignment varies in reality.

> 4. Measure the effect of dcache pressure on function performance
> 5. Measure effect of icache pressure on function performance.
>
Here you really need to base weigths on function usage patterns.
A bigger code size is acceptable for functions that are called more
often. You need to see distribution of how are calls clustered to get
full picture. A strcmp is least sensitive to icache concerns, as when it
is called its mostly 100 times over in tight loop so size is not big issue.
If same number of call is uniformnly spread through program we need
stricter criteria.

> Depending on the actual cost of cache misses on different processors,
> the icache/dcache miss cost would either have higher or lower weight
> but for 1-3, I'd go in that order of priorities with little concern
> for unaligned cases.
>
> Siddhesh
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Siddhesh Poyarekar-3
On Wed, Sep 04, 2013 at 01:03:33PM +0200, Ondřej Bílka wrote:
> > 1. Assume aligned input.  Nothing should take (any noticeable)
> >    performance away from align copies/moves
> Not very useful as this is extremely dependant on function measured. For
> functions like strcmp and strlen alignments are mostly random so aligned
> case does not say much. On opposite end of spectrum is memset which is
> almost always 8 byte aligned and unaligned performance does not make lot
> of sense.

Agreed.  So for functions like memset/memcpy/memmove we heavily favour
aligned inputs.  For strlen/strchr/memchr we strive for acceptable
average case performance, i.e. less variance in performance.

> > 2. Scale with size
> Not very important for several reasons. One is that big sizes are cold
> (just look in oprofile output that loops are less frequent than header.)
>
> Second reason is that if we look at caller large sizes are unlikely
> bottleneck.

I did not imply that we optimize for larger sizes - I meant that as a
general principle, the algorithm should scale reasonably for larger
sizes.  A quadratic algorithm is bad even if it gives acceptable
performance for smaller sizes.  I would consider that a pretty
important trait to monitor in the benchmark even if we won't really
get such implementations in practice.

> > 3. Provide acceptable performance for unaligned sizes without
> >    penalizing the aligned case
>
> This is quite important case. It should be measured correctly, what is
> important is that alignment varies. This can be slower than when you
> pick fixed alignment and alignment varies in reality.

I agree that we need to measure unaligned cases correctly.

> > 4. Measure the effect of dcache pressure on function performance
> > 5. Measure effect of icache pressure on function performance.
> >
> Here you really need to base weigths on function usage patterns.
> A bigger code size is acceptable for functions that are called more
> often. You need to see distribution of how are calls clustered to get
> full picture. A strcmp is least sensitive to icache concerns, as when it
> is called its mostly 100 times over in tight loop so size is not big issue.
> If same number of call is uniformnly spread through program we need
> stricter criteria.

That's not necessarily true.  It may be true for specific applications
but I don't think an strcmp is always called in a tight loop.  Do you
have a qualitative argument to prove that statement or is it just
based on dry runs?

Siddhesh
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
In reply to this post by Siddhesh Poyarekar-3
On 09/04/2013 03:30 AM, Siddhesh Poyarekar wrote:

> On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote:
>> I agree. The eventual goal of the project is to have some kind of
>> whole system benchmarking that allows users to feed in their profiles
>> and allow us as developers to see what users are doing with our library.
>>
>> Just like CPU designers feed in a whole distribution of applications
>> and look at the probability of instruction selection and tweak instruction
>> to microcode mappings.
>>
>> I am willing to accept a certain error in the process as long as I know
>> we are headed in the right direction. If we all disagree about the
>> direction we are going in then we should talk about it.
>>
>> I see:
>>
>> microbenchmarks -> whole system benchmarks -> profile driven optimizations
>
> I've mentioned this before - microbenchmarks are not a way to whole
> system benchmarks in that they don't replace system benchmarks.  We
> need to work on both in parallel because both have different goals.

Sorry, my ASCII graph was misleading, and I agree with you.

What I meant to say is that we need to walk before we run.
Creating a microbenchmark framework will give us the tools
we need to move on to whole system benchmarks and eventually
to profile driven optimizations. We will continue to do all
3 of these activities once we reach the end of that logical
progression.

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ryan S. Arnold
In reply to this post by Siddhesh Poyarekar-3
On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <[hidden email]> wrote:
> 1. Assume aligned input.  Nothing should take (any noticeable)
>    performance away from align copies/moves
> 2. Scale with size

In my experience scaling with data-size isn't really possible beyond a
certain point.  We pick a target range of sizes to optimize for based
upon customer feedback and we try to use pre-fetching in that range as
efficiently as possible.  But I get your point.  We don't want any
particular size to be severely penalized.

Each architecture and specific platform needs to know/decide what the
optimal range is and document it.  Even for Power we have different
expectations on server hardware like POWER7, vs. embedded hardware
like ppc 476.

> 3. Provide acceptable performance for unaligned sizes without
>    penalizing the aligned case

There are cases where the user can't control the alignment of the data
being fed into string functions, and we shouldn't penalize them for
these situations if possible, but in reality if a string routine shows
up hot in a profile this is a likely culprit and there's not much that
can be done once the unaligned case is made as stream-lined as
possible.

Simply testing for alignment (not presuming aligned data) itself slows
down the processing of aligned-data, but that's an unavoidable
reality.  I've chatted with some compiler folks about the possibility
of branching directly to aligned case labels in string routines if the
compiler is able to detect aligned data.. but was informed that this
suggestion might get me burned at the stake.

As previously discussed, we might be able to use tunables in the
future to mitigate this.  But of course, this would be 'use at your
own risk'.

> 4. Measure the effect of dcache pressure on function performance
> 5. Measure effect of icache pressure on function performance.
>
> Depending on the actual cost of cache misses on different processors,
> the icache/dcache miss cost would either have higher or lower weight
> but for 1-3, I'd go in that order of priorities with little concern
> for unaligned cases.

I know that icache and dcache miss penalty/costs are known for most
architectures but not whether they're "published".  I suppose we can,
at least, encourage developers for the CPU manufacturers to indicate
in the documentation of preconditions which is more expensive,
relative to the other if they're unable to indicate the exact costs of
these misses.

Some further thoughts (just to get this stuff documented):

Some performance regressions I'm familiar with (on Power), which CAN
be measured with a baseline micro-benchmark regardless of use-case:

1. Hazard/Penalties - I'm thinking things like load-hit-store in the
tail of a loop, e.g., label: load value from a register, do work,
store to same register, branch to loop.  Take a stall when the value
at the top of the loop isn't ready to load.

2. Dispatch grouping - Some instructions need to be first-in-group,
etc.  Grouping is also based on instruction alignment.  At least on
Power I believe some instructions benefit from specific alignment.

3. Instruction Grouping - Depending on topology of the pipeline,
specific groupings of instructions of might incur pipeline stalls due
to unavailability of the load/store unit (for instance).

4. Facility usage costs - Sometimes using certain facilities for
certain sizes of data are more costly than not using the facility.
For instance, I believe that using the DFPU on Power requires that the
floating-point pipeline be flushed, so BFP and DFP really shouldn't be
used together.  I believe there is a powerpc32 string function which
uses FPRs because they're 64-bits wide even on ppc32.  But we measured
the cost/benefit ratio of using this vs. not.

On Power, micro benchmarks are run in-house with these (and many
other) factors in mind.

Ryan
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ryan S. Arnold
In reply to this post by Ondřej Bílka
On Wed, Sep 4, 2013 at 6:03 AM, Ondřej Bílka <[hidden email]> wrote:
> On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote:
>> 2. Scale with size
> Not very important for several reasons. One is that big sizes are cold
> (just look in oprofile output that loops are less frequent than header.)
>
> Second reason is that if we look at caller large sizes are unlikely
> bottleneck.

From my experience, extremely large data sizes are not very common.
Optimizing for those gets diminishing returns.  I believe that at very
large sizes the pressure is all on the hardware anyway.  Prefetching
large amounts of data in a loop takes a fixed amount of time and given
a large enough amount of data, the overhead introduced by most other
factors is negligible.

>> 4. Measure the effect of dcache pressure on function performance
>> 5. Measure effect of icache pressure on function performance.
>>
> Here you really need to base weigths on function usage patterns.
> A bigger code size is acceptable for functions that are called more
> often. You need to see distribution of how are calls clustered to get
> full picture. A strcmp is least sensitive to icache concerns, as when it
> is called its mostly 100 times over in tight loop so size is not big issue.
> If same number of call is uniformnly spread through program we need
> stricter criteria.

Icache pressure is probably one of the more difficult things to
measure with a benchmark.  I suppose it'd be easier with a pipeline
analyzer.

Can you explain how usage pattern analysis might reveal icache pressure?

I'm not sure how useful 'usage pattern' are when considering dcache
pressure.  On Power we have data-cache prefetch instructions and since
we know that dcache pressure is a reality, we will prefetch if our
data sizes are large enough to out-weigh the overhead of prefetching,
e.g., when the data size exceeds the cacheline size.

Ryan
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
On Wed, Sep 04, 2013 at 12:37:33PM -0500, Ryan S. Arnold wrote:

> On Wed, Sep 4, 2013 at 6:03 AM, Ondřej Bílka <[hidden email]> wrote:
> > On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote:
> >> 4. Measure the effect of dcache pressure on function performance
> >> 5. Measure effect of icache pressure on function performance.
> >>
> > Here you really need to base weigths on function usage patterns.
> > A bigger code size is acceptable for functions that are called more
> > often. You need to see distribution of how are calls clustered to get
> > full picture. A strcmp is least sensitive to icache concerns, as when it
> > is called its mostly 100 times over in tight loop so size is not big issue.
> > If same number of call is uniformnly spread through program we need
> > stricter criteria.
>
> Icache pressure is probably one of the more difficult things to
> measure with a benchmark.  I suppose it'd be easier with a pipeline
> analyzer.
>
> Can you explain how usage pattern analysis might reveal icache pressure?
>
With profiler its simple, I profiled firefox a while, results are here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile_firefox/result.html

Now when you look to 'Delays between calls' graph you will see peak
which is likely caused by strcmp being called in loop.

From graph about 2/3 of calls happen in less than 128 cycles since last
one. As there is limited number of cache lines that you can access in
128 cycles per call impact is smaller.

> I'm not sure how useful 'usage pattern' are when considering dcache
> pressure.  On Power we have data-cache prefetch instructions and since
> we know that dcache pressure is a reality, we will prefetch if our
> data sizes are large enough to out-weigh the overhead of prefetching,
> e.g., when the data size exceeds the cacheline size.
>
Very useful as overhead of prefetching is determined that this quantity.
You can have two applications that often call memset with size 16000.

First one uses memset to refresh one static array which is entirely in
L1 cache and prefetching is harmful.

Second one does random access of 1GB of memory and prefetching would
help.

Swithching to prefetching when you exceed cache size has advantage of
certainty that is will help.
Real treshold is lower as it is unlikely that large array got as
argument is only thing that occupies cache.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
In reply to this post by Ryan S. Arnold
On Wed, Sep 04, 2013 at 12:35:46PM -0500, Ryan S. Arnold wrote:

> On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <[hidden email]> wrote:
> > 3. Provide acceptable performance for unaligned sizes without
> >    penalizing the aligned case
>
> There are cases where the user can't control the alignment of the data
> being fed into string functions, and we shouldn't penalize them for
> these situations if possible, but in reality if a string routine shows
> up hot in a profile this is a likely culprit and there's not much that
> can be done once the unaligned case is made as stream-lined as
> possible.
>
> Simply testing for alignment (not presuming aligned data) itself slows
> down the processing of aligned-data, but that's an unavoidable
> reality.

How expensive are unaligned loads on powerpc?  On x64 a penalty for
using them is smaller than alternatives(increased branch
misprediction...)

>  I've chatted with some compiler folks about the possibility
> of branching directly to aligned case labels in string routines if the
> compiler is able to detect aligned data.. but was informed that this
> suggestion might get me burned at the stake.
>
You would need to improve gcc detection of alignments first. Now gcc
misses most of opportunities, even in following code gcc issues
retundant alignment checks:

#include <stdint.h>
char *foo(long *x){
 if (((uintptr_t)x)%16)
  return x+4;
 else {
  __builtin_memset(x,0,512);
  return x;
 }
}

If gcc guys fix that then we do not have to ask them anything. We could
just change headers to recognize aligned case like

#define strchr(x,c) ({ char *__x=x;\
  if (__builtin_constant_p(((uintptr_t)__x)%16) && !((uintptr_t)__x)%16)\
    strchr_aligned(__x,c);\
  else\
    strchr(__x,c);\
})

> > 4. Measure the effect of dcache pressure on function performance
> > 5. Measure effect of icache pressure on function performance.
> >
> > Depending on the actual cost of cache misses on different processors,
> > the icache/dcache miss cost would either have higher or lower weight
> > but for 1-3, I'd go in that order of priorities with little concern
> > for unaligned cases.
>
> I know that icache and dcache miss penalty/costs are known for most
> architectures but not whether they're "published".  I suppose we can,
> at least, encourage developers for the CPU manufacturers to indicate
> in the documentation of preconditions which is more expensive,
> relative to the other if they're unable to indicate the exact costs of
> these misses.
>
These cost are relatively difficult to describe, take strlen on main
memory as example.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strlen_profile/results_rand_nocache/result.html

Here we see hardware prefetcher in action. A time goes linearly with
size until 512 bytes and remains constant until 4096 bytes(switch to
block view) where it starts increasing at slower rate.
 
For core2 shape is similar except that plateau starts at 256 bytes and
ends at 1024 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strlen_profile/results_rand_nocache/result.html

AMD processors are different, phenomII performance is line, and for fx10
there is even area where time decreases with size.
http://kam.mff.cuni.cz/~ondra/benchmark_string/phenomII/strlen_profile/results_rand_nocache/result.html 
http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/strlen_profile/results_rand_nocache/result.html 


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Joseph Myers
In reply to this post by Ryan S. Arnold
On Wed, 4 Sep 2013, Ryan S. Arnold wrote:

> Simply testing for alignment (not presuming aligned data) itself slows
> down the processing of aligned-data, but that's an unavoidable
> reality.  I've chatted with some compiler folks about the possibility
> of branching directly to aligned case labels in string routines if the
> compiler is able to detect aligned data.. but was informed that this
> suggestion might get me burned at the stake.

See the ARM EABI __aeabi_mem* functions.  At present the glibc versions
just wrap or alias the generic ones, so don't take advantage of the extra
alignment information (and the constraints on register clobbers also mean
plain ARM memcpy gets used for them rather than the NEON version).  And
GCC doesn't detect alignment and generate calls to those functions (which
would be a pessimization anyway without glibc implementing these
functions more optimally).  But the principle makes sense, subject to not
pessimizing real use cases by expanding libc code size unduly (different
entry points to functions may be better than duplicating large amounts of
code for each alignment case).

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
In reply to this post by Ryan S. Arnold
On Tue, Sep 03, 2013 at 02:34:54PM -0500, Ryan S. Arnold wrote:

> On Tue, Sep 3, 2013 at 12:37 PM, Ondřej Bílka <[hidden email]> wrote:
> >> I disagree strongly. You *must* come up with a measurable answer and
> >> looking at a graph is never a solution I'm going to accept.
> >>
> > You can have that opinion.
> > Looking at performance graphs is most powerful technique how to
> > understand performance. I got most of my improvements from analyzing
> > these.
>
> Are there any open source pipeline analysis tools?  I've found the one
> I've used (proprietary) to be a pretty good indicator of general
> instruction selection optimization/correctness.
>
After bit of googling I found http://marss86.org/~marss86/index.php/Home
Is anyone familiar with it?

This reminded me to work on project in my backlog. Writing tool that can
legaly shuffle assembler instructions.
Then I run simulated annaling to get optimal sheduling for given processor
and benchmark.
With dryrun data this will be close to real performance on small size so
I could get some percents there.
12