[PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Will Newton

A small change to the entry to the aligned copy loop improves
performance slightly on A9 and A15 cores for certain copies.

ports/ChangeLog.arm:

2013-08-07  Will Newton  <[hidden email]>

        * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
        on entry to aligned copy loop for improved performance.
---
 ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
index 3decad6..6e84173 100644
--- a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
+++ b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
@@ -369,8 +369,8 @@ ENTRY(memcpy)
  cfi_adjust_cfa_offset (FRAME_SIZE)
  cfi_rel_offset (tmp2, 0)
  cfi_remember_state
- and tmp2, src, #3
- and tmp1, dst, #3
+ and tmp2, src, #7
+ and tmp1, dst, #7
  cmp tmp1, tmp2
  bne .Lcpy_notaligned

--
1.8.1.4

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Will Newton
On 12 August 2013 08:55, Will Newton <[hidden email]> wrote:

>
> A small change to the entry to the aligned copy loop improves
> performance slightly on A9 and A15 cores for certain copies.
>
> ports/ChangeLog.arm:
>
> 2013-08-07  Will Newton  <[hidden email]>
>
>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>         on entry to aligned copy loop for improved performance.
> ---
>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Ping?

--
Will Newton
Toolchain Working Group, Linaro
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Joseph Myers
In reply to this post by Will Newton
On Mon, 12 Aug 2013, Will Newton wrote:

> A small change to the entry to the aligned copy loop improves
> performance slightly on A9 and A15 cores for certain copies.

Could you clarify what you mean by "certain copies"?

In particular, have you verified that for all three choices in this code
(NEON, VFP or neither), the code for unaligned copies is at least as fast
in this case (common 32-bit alignment, but not common 64-bit alignment) as
the code that would previously have been used in those cases?

There are various comments regarding alignment, whether stating "LDRD/STRD
support unaligned word accesses" or referring to the mutual alignment that
applies for particular code.  Does this patch make any of them out of
date?  (If code can now only be reached with common 64-bit alignment, but
in fact requires only 32-bit alignment, the comment should probably state
both those things explicitly.)

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Will Newton
On 30 August 2013 00:58, Joseph S. Myers <[hidden email]> wrote:

Hi Joseph,

>> A small change to the entry to the aligned copy loop improves
>> performance slightly on A9 and A15 cores for certain copies.
>
> Could you clarify what you mean by "certain copies"?

Large copies (> 16kB) where the buffers are 4-byte aligned but not
8-byte aligned. I'll respin the patch with an improved description.

> In particular, have you verified that for all three choices in this code
> (NEON, VFP or neither), the code for unaligned copies is at least as fast
> in this case (common 32-bit alignment, but not common 64-bit alignment) as
> the code that would previously have been used in those cases?

Yes, the performance is very similar but slightly better in the NEON
case and approximately unchanged in the others.

> There are various comments regarding alignment, whether stating "LDRD/STRD
> support unaligned word accesses" or referring to the mutual alignment that
> applies for particular code.  Does this patch make any of them out of
> date?  (If code can now only be reached with common 64-bit alignment, but
> in fact requires only 32-bit alignment, the comment should probably state
> both those things explicitly.)

I've reviewed the comments and they all look ok as far as I can tell.

--
Will Newton
Toolchain Working Group, Linaro
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Joseph Myers
On Fri, 30 Aug 2013, Will Newton wrote:

> > There are various comments regarding alignment, whether stating "LDRD/STRD
> > support unaligned word accesses" or referring to the mutual alignment that
> > applies for particular code.  Does this patch make any of them out of
> > date?  (If code can now only be reached with common 64-bit alignment, but
> > in fact requires only 32-bit alignment, the comment should probably state
> > both those things explicitly.)
>
> I've reviewed the comments and they all look ok as far as I can tell.

Are you sure?  For example, where it says "SRC and DST have the same
mutual 32-bit alignment", is it not in fact the case after the patch that
they now have the same mutual 64-bit alignment, even if this code doesn't
currently rely on that?  I think the comments in each place should be
explicit about both things - what preconditions the code relies on, and
what possibly stronger preconditions are in fact true given the other code
in this file.  Saying the mutual alignment is 32-bit when the reader can
see from the code not far above that it's 64-bit just seems liable to
confuse the reader, even if the comment is still formally true.  
Similarly for the requirements on unaligned word accesses - after this
patch, which uses of ldrd/strd do require that?

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
In reply to this post by Will Newton
On 08/27/2013 03:46 AM, Will Newton wrote:

> On 12 August 2013 08:55, Will Newton <[hidden email]> wrote:
>>
>> A small change to the entry to the aligned copy loop improves
>> performance slightly on A9 and A15 cores for certain copies.
>>
>> ports/ChangeLog.arm:
>>
>> 2013-08-07  Will Newton  <[hidden email]>
>>
>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>         on entry to aligned copy loop for improved performance.
>> ---
>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> Ping?

How did you test the performance?

glibc has a performance microbenchmark, did you use that?

Cheers,
Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Will Newton
In reply to this post by Joseph Myers
On 30 August 2013 16:18, Joseph S. Myers <[hidden email]> wrote:

> On Fri, 30 Aug 2013, Will Newton wrote:
>
>> > There are various comments regarding alignment, whether stating "LDRD/STRD
>> > support unaligned word accesses" or referring to the mutual alignment that
>> > applies for particular code.  Does this patch make any of them out of
>> > date?  (If code can now only be reached with common 64-bit alignment, but
>> > in fact requires only 32-bit alignment, the comment should probably state
>> > both those things explicitly.)
>>
>> I've reviewed the comments and they all look ok as far as I can tell.
>
> Are you sure?  For example, where it says "SRC and DST have the same
> mutual 32-bit alignment", is it not in fact the case after the patch that
> they now have the same mutual 64-bit alignment, even if this code doesn't
> currently rely on that?  I think the comments in each place should be
> explicit about both things - what preconditions the code relies on, and
> what possibly stronger preconditions are in fact true given the other code
> in this file.  Saying the mutual alignment is 32-bit when the reader can
> see from the code not far above that it's 64-bit just seems liable to
> confuse the reader, even if the comment is still formally true.
> Similarly for the requirements on unaligned word accesses - after this
> patch, which uses of ldrd/strd do require that?

Yes, you're right, that needs more thought. I'll have a look at it next week.

--
Will Newton
Toolchain Working Group, Linaro
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Will Newton
In reply to this post by Carlos O'Donell-6
On 30 August 2013 18:14, Carlos O'Donell <[hidden email]> wrote:

Hi Carlos,

>>> A small change to the entry to the aligned copy loop improves
>>> performance slightly on A9 and A15 cores for certain copies.
>>>
>>> ports/ChangeLog.arm:
>>>
>>> 2013-08-07  Will Newton  <[hidden email]>
>>>
>>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>>         on entry to aligned copy loop for improved performance.
>>> ---
>>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> Ping?
>
> How did you test the performance?
>
> glibc has a performance microbenchmark, did you use that?

No, I used the cortex-strings package developed by Linaro for
benchmarking various string functions against one another[1].

I haven't checked the glibc benchmarks but I'll look into that. It's
quite a specific case that shows the problem so it may not be obvious
which one is better however.

[1] https://launchpad.net/cortex-strings

--
Will Newton
Toolchain Working Group, Linaro
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
On 08/30/2013 02:48 PM, Will Newton wrote:

> On 30 August 2013 18:14, Carlos O'Donell <[hidden email]> wrote:
>
> Hi Carlos,
>
>>>> A small change to the entry to the aligned copy loop improves
>>>> performance slightly on A9 and A15 cores for certain copies.
>>>>
>>>> ports/ChangeLog.arm:
>>>>
>>>> 2013-08-07  Will Newton  <[hidden email]>
>>>>
>>>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>>>         on entry to aligned copy loop for improved performance.
>>>> ---
>>>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> Ping?
>>
>> How did you test the performance?
>>
>> glibc has a performance microbenchmark, did you use that?
>
> No, I used the cortex-strings package developed by Linaro for
> benchmarking various string functions against one another[1].
>
> I haven't checked the glibc benchmarks but I'll look into that. It's
> quite a specific case that shows the problem so it may not be obvious
> which one is better however.

If it's not obvious how is someone supposed to review this patch? :-)

> [1] https://launchpad.net/cortex-strings

There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string
test in and of itself which should not be used for benchmarking or changing
string functions. The other is called "multi" and appears to run some functions
in a loop and take the time.

e.g.
http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c

I would not call `multi' exhaustive, and while neither is the glibc performance
benchmark tests the glibc tests have received review from the glibc community
and are our preferred way of demonstrating performance gains when posting
performance patches.

I would really really like to see you post the results of running your new
implementation with this benchmark and show the numbers that claim this is
faster. Is that possible?

Cheers,
Carlos.
 

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Will Newton
On 30 August 2013 20:26, Carlos O'Donell <[hidden email]> wrote:

> On 08/30/2013 02:48 PM, Will Newton wrote:
>> On 30 August 2013 18:14, Carlos O'Donell <[hidden email]> wrote:
>>
>> Hi Carlos,
>>
>>>>> A small change to the entry to the aligned copy loop improves
>>>>> performance slightly on A9 and A15 cores for certain copies.
>>>>>
>>>>> ports/ChangeLog.arm:
>>>>>
>>>>> 2013-08-07  Will Newton  <[hidden email]>
>>>>>
>>>>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>>>>         on entry to aligned copy loop for improved performance.
>>>>> ---
>>>>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> Ping?
>>>
>>> How did you test the performance?
>>>
>>> glibc has a performance microbenchmark, did you use that?
>>
>> No, I used the cortex-strings package developed by Linaro for
>> benchmarking various string functions against one another[1].
>>
>> I haven't checked the glibc benchmarks but I'll look into that. It's
>> quite a specific case that shows the problem so it may not be obvious
>> which one is better however.
>
> If it's not obvious how is someone supposed to review this patch? :-)
>
>> [1] https://launchpad.net/cortex-strings
>
> There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string
> test in and of itself which should not be used for benchmarking or changing
> string functions. The other is called "multi" and appears to run some functions
> in a loop and take the time.
>
> e.g.
> http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c
>
> I would not call `multi' exhaustive, and while neither is the glibc performance
> benchmark tests the glibc tests have received review from the glibc community
> and are our preferred way of demonstrating performance gains when posting
> performance patches.
>
> I would really really like to see you post the results of running your new
> implementation with this benchmark and show the numbers that claim this is
> faster. Is that possible?

The mailing list server does not seem to accept image attachments so I
have uploaded the performance graph here:

http://people.linaro.org/~will.newton/glibc_memcpy/sizes-memcpy-08-04-2.5.png

--
Will Newton
Toolchain Working Group, Linaro
Reply | Threaded
Open this post in threaded view
|

benchmark improvements (Was: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.)

Siddhesh Poyarekar-3
In reply to this post by Carlos O'Donell-6
Adding libc-alpha since my comments on this are more suitable there -
maybe we should just get rid of the ports mailing list altogether.

On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote:
> The key advantage of the cortex-strings framework is that it allows
> graphing the results of benchmarks. Often changes to string function
> performance can only really be analysed graphically as otherwise you
> end up with a huge soup of numbers, some going up, some going down and
> it is very hard to separate the signal from the noise.

We already depend on gd-devel to draw graphs for memusagestat.  Maybe
we could leverage that in the microbenchmark as well and come up with
html reports rather than just a bench.out?  Graphs in general are a
lot more readable and I agree that they would be the next logical
improvement to the benchmark framework.

> The glibc benchmarks also have some other weaknesses that should
> really be addressed, hopefully I'll have some time to write patches
> for some of this work.

I know Ondrej had proposed a few improvements as well.  I'd like to
see those reposted so that we can look at it and if possible, have
them merged in.

Siddhesh
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
In reply to this post by Carlos O'Donell-6
On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote:

> On 30 August 2013 20:26, Carlos O'Donell <[hidden email]> wrote:
> > On 08/30/2013 02:48 PM, Will Newton wrote:
> >> On 30 August 2013 18:14, Carlos O'Donell <[hidden email]> wrote:
> >>>> Ping?
> >>>
> >>> How did you test the performance?
> >>>
> >>> glibc has a performance microbenchmark, did you use that?
> >>
> >> No, I used the cortex-strings package developed by Linaro for
> >> benchmarking various string functions against one another[1].
> >>
> >> I haven't checked the glibc benchmarks but I'll look into that. It's
> >> quite a specific case that shows the problem so it may not be obvious
> >> which one is better however.
> >
> > If it's not obvious how is someone supposed to review this patch? :-)
>
> With difficulty. ;-)
>
> Joseph has raised some good points about the comments and I'll go back
> through the code and make sure everything is correct in that regard.
> The change was actually made to the copy of the code in cortex-strings
> some time ago but I delayed pushing the patch due to the 2.18 release
> so I have to refresh my memory somewhat.
>
> Ideally we would have an agreed upon benchmark with which everyone
> could analyse the performance of the code on their systems, however
> that does not seem to exist as far as I can tell.
>
Well, for measuring performance about only way that everybody will agree
with is compile implementations as old.so and new.so and then use

LD_PRELOAD=old.so time cmd
LD_PRELOAD=new.so time cmd

in loop until you calculate that there is statistically significant
difference (provided that commands you use are representative enough).

For any other somebody will argue that its opposite because you
forgotten to take some factor into account.

Even when you change LD_PRELOAD=old.so implementation that to
accurately measure time spend in function it need not be enough.

You could have implementation that will be 5 cycles faster on that
benchmark but slower in reality because

1) Its code is 1000 bytes bigger than alternative. Gains in function
itself will be eaten by instruction cache misses outside function.
Or
2) Function agressively prefetches data (say loop that prefetches
lines 512 bytes after current buffer position). This makes benchmark
performance better but cache will be littered by data after buffer end
buffers, real performance suffers.
Or
3) For malloc saving metadata at same cache line as start of allocated
memory could make benchmark look bad due to cache misses. But it will
improve performance as user will write there and metadata write serves
as prefetch.
or
...


> > e.g.
> > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c
> >
> > I would not call `multi' exhaustive, and while neither is the glibc performance
> > benchmark tests the glibc tests have received review from the glibc community
> > and are our preferred way of demonstrating performance gains when posting
> > performance patches.
>
> The key advantage of the cortex-strings framework is that it allows
> graphing the results of benchmarks. Often changes to string function
> performance can only really be analysed graphically as otherwise you
> end up with a huge soup of numbers, some going up, some going down and
> it is very hard to separate the signal from the noise.
>
Like following? http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_rand/result.html

On real workloads of memcpy it is still bit hard to see what is going on:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_gcc/result.html


> The glibc benchmarks also have some other weaknesses that should
> really be addressed, hopefully I'll have some time to write patches
> for some of this work.
>
How will you fix measuring in tight loop with same arguments and only 32 times?

Reply | Threaded
Open this post in threaded view
|

Re: benchmark improvements (Was: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.)

Will Newton
In reply to this post by Siddhesh Poyarekar-3
On 2 September 2013 15:20, Siddhesh Poyarekar <[hidden email]> wrote:

> Adding libc-alpha since my comments on this are more suitable there -
> maybe we should just get rid of the ports mailing list altogether.
>
> On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote:
>> The key advantage of the cortex-strings framework is that it allows
>> graphing the results of benchmarks. Often changes to string function
>> performance can only really be analysed graphically as otherwise you
>> end up with a huge soup of numbers, some going up, some going down and
>> it is very hard to separate the signal from the noise.
>
> We already depend on gd-devel to draw graphs for memusagestat.  Maybe
> we could leverage that in the microbenchmark as well and come up with
> html reports rather than just a bench.out?  Graphs in general are a
> lot more readable and I agree that they would be the next logical
> improvement to the benchmark framework.

gd is quite basic from a graph plotting point of view, it's more of a
low-level graphics library and requires wrappers to call from
scripting languages. I already have some code I can cannibalize from
cortex-strings that uses Python and Pylab to do the plotting, I'll
submit a patch for review soon.

>> The glibc benchmarks also have some other weaknesses that should
>> really be addressed, hopefully I'll have some time to write patches
>> for some of this work.
>
> I know Ondrej had proposed a few improvements as well.  I'd like to
> see those reposted so that we can look at it and if possible, have
> them merged in.

I already have a patch to do multiple runs of benchmarks  - some
things like physical page allocation that can impact a benchmark can
only be controlled for this way. As I mentioned above I'd also like to
get graphing capability in there too. Beyond that it would be nice to
have a look at the various sizes and alignments used and make sure
there is a reasonably complete set, and to make sure the tests are run
for a useful number of iterations (not too large or too small).


--
Will Newton
Toolchain Working Group, Linaro
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
In reply to this post by Will Newton
On 09/02/2013 10:18 AM, Will Newton wrote:
> The mailing list server does not seem to accept image attachments so I
> have uploaded the performance graph here:
>
> http://people.linaro.org/~will.newton/glibc_memcpy/sizes-memcpy-08-04-2.5.png

Excellent graph.

Cheers,
Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
In reply to this post by Carlos O'Donell-6
On 09/02/2013 09:58 AM, Will Newton wrote:
>> If it's not obvious how is someone supposed to review this patch? :-)
>
> With difficulty. ;-)

Thank you for acknowledging that.

> Joseph has raised some good points about the comments and I'll go back
> through the code and make sure everything is correct in that regard.
> The change was actually made to the copy of the code in cortex-strings
> some time ago but I delayed pushing the patch due to the 2.18 release
> so I have to refresh my memory somewhat.
>
> Ideally we would have an agreed upon benchmark with which everyone
> could analyse the performance of the code on their systems, however
> that does not seem to exist as far as I can tell.

We have one, it's the glibc microbenchmark, and we want to expand it,
otherwise when ACME comes with their patch for ARM and breaks performance
for targets that Linaro cares about I have no way to reject the patch
objectively :-)

> The key advantage of the cortex-strings framework is that it allows
> graphing the results of benchmarks. Often changes to string function
> performance can only really be analysed graphically as otherwise you
> end up with a huge soup of numbers, some going up, some going down and
> it is very hard to separate the signal from the noise.

I disagree strongly. You *must* come up with a measurable answer and
looking at a graph is never a solution I'm going to accept.

You need to statistically analyze the numbers, assign weights to ranges,
and come up with some kind of number that evaluates the results based
on *some* formula. That is the only way we are going to keep moving
performance forward (against some kind of criteria).

> The glibc benchmarks also have some other weaknesses that should
> really be addressed, hopefully I'll have some time to write patches
> for some of this work.

Thank you very much.

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
On Tue, Sep 03, 2013 at 12:18:24PM -0400, Carlos O'Donell wrote:

> On 09/02/2013 09:58 AM, Will Newton wrote:
> >> If it's not obvious how is someone supposed to review this patch? :-)
> >
> > With difficulty. ;-)
>
> Thank you for acknowledging that.
>
> > Joseph has raised some good points about the comments and I'll go back
> > through the code and make sure everything is correct in that regard.
> > The change was actually made to the copy of the code in cortex-strings
> > some time ago but I delayed pushing the patch due to the 2.18 release
> > so I have to refresh my memory somewhat.
> >
> > Ideally we would have an agreed upon benchmark with which everyone
> > could analyse the performance of the code on their systems, however
> > that does not seem to exist as far as I can tell.
>
> We have one, it's the glibc microbenchmark, and we want to expand it,
> otherwise when ACME comes with their patch for ARM and breaks performance
> for targets that Linaro cares about I have no way to reject the patch
> objectively :-)
>
Carlos, you are asking for impossible. When you publish benchmark people
will try to maximize benchmark number. After certain point this becomes
possible only by employing shady accounting: Move part of time to place
wehre it will not be measured by benchmark (for example by having
function that is 4kb large, on benchmarks it will fit into instruction
cache but that does not happen in reality).

Taking care of common factors that can cause that is about ten times
more complex than whole system benchmarking, analysis will be quite
difficult as you will get twenty numbers and you will need to decide
which ones could made real impact and which wont.
 
> > The key advantage of the cortex-strings framework is that it allows
> > graphing the results of benchmarks. Often changes to string function
> > performance can only really be analysed graphically as otherwise you
> > end up with a huge soup of numbers, some going up, some going down and
> > it is very hard to separate the signal from the noise.
>
> I disagree strongly. You *must* come up with a measurable answer and
> looking at a graph is never a solution I'm going to accept.
>
You can have that opinion.
Looking at performance graphs is most powerful technique how to
understand performance. I got most of my improvements from analyzing
these.

> You need to statistically analyze the numbers, assign weights to ranges,
> and come up with some kind of number that evaluates the results based
> on *some* formula. That is the only way we are going to keep moving
> performance forward (against some kind of criteria).
>
These accurate assigning weigths is best done by taking program running
it and measuring time. Without taking this into account weigths will not
tell much, as you will likely just optimize cold code at expense of hot
code.

> > The glibc benchmarks also have some other weaknesses that should
> > really be addressed, hopefully I'll have some time to write patches
> > for some of this work.
>
> Thank you very much.
>
> Cheers,
> Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: benchmark improvements (Was: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.)

Ondřej Bílka
In reply to this post by Will Newton
On Tue, Sep 03, 2013 at 02:46:13PM +0100, Will Newton wrote:

> On 2 September 2013 15:20, Siddhesh Poyarekar <[hidden email]> wrote:
> >> The glibc benchmarks also have some other weaknesses that should
> >> really be addressed, hopefully I'll have some time to write patches
> >> for some of this work.
> >
> > I know Ondrej had proposed a few improvements as well.  I'd like to
> > see those reposted so that we can look at it and if possible, have
> > them merged in.
>
> I already have a patch to do multiple runs of benchmarks  - some
> things like physical page allocation that can impact a benchmark can
> only be controlled for this way. As I mentioned above I'd also like to
> get graphing capability in there too. Beyond that it would be nice to
> have a look at the various sizes and alignments used and make sure
> there is a reasonably complete set, and to make sure the tests are run
> for a useful number of iterations (not too large or too small).
>
For alignments do you want existing implementation to take

a) 0.031s b) 0.080s c) 0.036s

If you want to get your implementation accepted pick a), if you do not
like ACME implementation pick b), otherwise pick c).

I got those numbers by 'benchmarking' memchr with alignment 15 and size
15 on ivy bridge. (benchmark attached.)

Current memchr implementation has separate branches for loads that cross
cache line and those that don't. For a) addresses are of form 64*x+15,
for b) 64*x+63, and for c) 16*x+15.




memchr.tar.bz2 (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
In reply to this post by Ondřej Bílka
On 09/03/2013 01:37 PM, Ondřej Bílka wrote:

>> We have one, it's the glibc microbenchmark, and we want to expand it,
>> otherwise when ACME comes with their patch for ARM and breaks performance
>> for targets that Linaro cares about I have no way to reject the patch
>> objectively :-)
>>
> Carlos, you are asking for impossible. When you publish benchmark people
> will try to maximize benchmark number. After certain point this becomes
> possible only by employing shady accounting: Move part of time to place
> wehre it will not be measured by benchmark (for example by having
> function that is 4kb large, on benchmarks it will fit into instruction
> cache but that does not happen in reality).

What is it that I'm asking that is impossible?

> Taking care of common factors that can cause that is about ten times
> more complex than whole system benchmarking, analysis will be quite
> difficult as you will get twenty numbers and you will need to decide
> which ones could made real impact and which wont.

Sorry, could you clarify this a bit more, exactly what is ten times
more complex?

If we have N tests and they produce N numbers, for a given target,
for a given device, for a given workload, there is a set of importance
weights on N that should give you some kind of relevance.

We should be able to come up with some kind of framework from which
we can clearly say "this patch is better than this other patch", even
if not automated, it should be possible to reason from the results,
and that reasoning recorded as a discussion on this list.

>>> The key advantage of the cortex-strings framework is that it allows
>>> graphing the results of benchmarks. Often changes to string function
>>> performance can only really be analysed graphically as otherwise you
>>> end up with a huge soup of numbers, some going up, some going down and
>>> it is very hard to separate the signal from the noise.
>>
>> I disagree strongly. You *must* come up with a measurable answer and
>> looking at a graph is never a solution I'm going to accept.
>>
> You can have that opinion.
> Looking at performance graphs is most powerful technique how to
> understand performance. I got most of my improvements from analyzing
> these.

That is a different use for the graphs. I do not disagree that graphing
is a *powerful* way to display information and using that information to
produce a new routine is useful. What I disagree with is using such graphs
to argue qualitatively that your patch is better than the existing
implementation.

There is always a quantitative way to say X is better than Y, but it
requires breaking down your expectations and documenting them e.g.
should be faster with X alignment on sizes from N bytes to M bytes, and
then ranking based on those criteria.

>> You need to statistically analyze the numbers, assign weights to ranges,
>> and come up with some kind of number that evaluates the results based
>> on *some* formula. That is the only way we are going to keep moving
>> performance forward (against some kind of criteria).
>>
> These accurate assigning weigths is best done by taking program running
> it and measuring time. Without taking this into account weigths will not
> tell much, as you will likely just optimize cold code at expense of hot
> code.

I don't disagree with you here.

Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Ondřej Bílka
On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote:

> On 09/03/2013 01:37 PM, Ondřej Bílka wrote:
> >> We have one, it's the glibc microbenchmark, and we want to expand it,
> >> otherwise when ACME comes with their patch for ARM and breaks performance
> >> for targets that Linaro cares about I have no way to reject the patch
> >> objectively :-)
> >>
> > Carlos, you are asking for impossible. When you publish benchmark people
> > will try to maximize benchmark number. After certain point this becomes
> > possible only by employing shady accounting: Move part of time to place
> > wehre it will not be measured by benchmark (for example by having
> > function that is 4kb large, on benchmarks it will fit into instruction
> > cache but that does not happen in reality).
>
> What is it that I'm asking that is impossible?
>
Having static set of benchmarks that can say if implementation is
improvement.

We are shooting to moving target, architectures change and as what we write
will code that will come to users with considerable delay and factors we
used for decision will change in meantime.

Once implementation reach certain quality question what is better
becomes dependent on program used. Until we could decide from profile
feedback we will lose some percents by having to use single
implementation.
 
> > Taking care of common factors that can cause that is about ten times
> > more complex than whole system benchmarking, analysis will be quite
> > difficult as you will get twenty numbers and you will need to decide
> > which ones could made real impact and which wont.
>
> Sorry, could you clarify this a bit more, exactly what is ten times
> more complex?
>
Having benchmark suite that will catch all relevant factors that can
affect performance. Some are hard to qualify for them we need to know
how average program stresses resources.

Take instruction cache usage, a function will occupy cache lines and we
can accurately measure probability and cost of cache misses inside
function. What is hard to estimate is how this will affect rest of
program. For this we would need to know average probability that cache
line will be referenced in future.
 
> If we have N tests and they produce N numbers, for a given target,
> for a given device, for a given workload, there is a set of importance
> weights on N that should give you some kind of relevance.
>
You are jumping to case when we will have these weights. Problematic
part is getting those.

> We should be able to come up with some kind of framework from which
> we can clearly say "this patch is better than this other patch", even
> if not automated, it should be possible to reason from the results,
> and that reasoning recorded as a discussion on this list.
>
What is possible is to say that some patch is significantly worse based
on some criteria. There is lot of gray area where decision is unclear.

> >>> The key advantage of the cortex-strings framework is that it allows
> >>> graphing the results of benchmarks. Often changes to string function
> >>> performance can only really be analysed graphically as otherwise you
> >>> end up with a huge soup of numbers, some going up, some going down and
> >>> it is very hard to separate the signal from the noise.
> >>
> >> I disagree strongly. You *must* come up with a measurable answer and
> >> looking at a graph is never a solution I'm going to accept.
> >>
> > You can have that opinion.
> > Looking at performance graphs is most powerful technique how to
> > understand performance. I got most of my improvements from analyzing
> > these.
>
> That is a different use for the graphs. I do not disagree that graphing
> is a *powerful* way to display information and using that information to
> produce a new routine is useful. What I disagree with is using such graphs
> to argue qualitatively that your patch is better than the existing
> implementation.
>
> There is always a quantitative way to say X is better than Y, but it
> requires breaking down your expectations and documenting them e.g.
> should be faster with X alignment on sizes from N bytes to M bytes, and
> then ranking based on those criteria.
>
> >> You need to statistically analyze the numbers, assign weights to ranges,
> >> and come up with some kind of number that evaluates the results based
> >> on *some* formula. That is the only way we are going to keep moving
> >> performance forward (against some kind of criteria).
> >>
> > These accurate assigning weigths is best done by taking program running
> > it and measuring time. Without taking this into account weigths will not
> > tell much, as you will likely just optimize cold code at expense of hot
> > code.
>
> I don't disagree with you here.
>
> Cheers,
> Carlos.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Carlos O'Donell-6
On 09/03/2013 02:57 PM, Ondřej Bílka wrote:

> On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote:
>> On 09/03/2013 01:37 PM, Ondřej Bílka wrote:
>>>> We have one, it's the glibc microbenchmark, and we want to expand it,
>>>> otherwise when ACME comes with their patch for ARM and breaks performance
>>>> for targets that Linaro cares about I have no way to reject the patch
>>>> objectively :-)
>>>>
>>> Carlos, you are asking for impossible. When you publish benchmark people
>>> will try to maximize benchmark number. After certain point this becomes
>>> possible only by employing shady accounting: Move part of time to place
>>> wehre it will not be measured by benchmark (for example by having
>>> function that is 4kb large, on benchmarks it will fit into instruction
>>> cache but that does not happen in reality).
>>
>> What is it that I'm asking that is impossible?
>>
> Having static set of benchmarks that can say if implementation is
> improvement.

I agree that a static set of benchmarks will not provide us with an
accurate answer for "Is this patch objectively better for performance?"

I don't see that that makes it impossible. The static set of benchmarks
at a given point in time (since we should be adding new benchmarks) may
have some error with respect to "fastest" for a given ISA/device/workload
(which I will shorten as `workload' from now on). Therefore I would not
say it's impossible.

It's probably impossible for the error between the estimator and reality
being close to zero though. That's just expected.

> We are shooting to moving target, architectures change and as what we write
> will code that will come to users with considerable delay and factors we
> used for decision will change in meantime.

What's wrong with a moving target?

I have spoken with CPU designers and I've been told that they do whole
system profiling in order assist in making instruction to microcode
decoding decisions. Therefore what we select as optimal sequences are
also fed forward into *new* architecture designs.

> Once implementation reach certain quality question what is better
> becomes dependent on program used. Until we could decide from profile
> feedback we will lose some percents by having to use single
> implementation.

I agree. The eventual goal of the project is to have some kind of
whole system benchmarking that allows users to feed in their profiles
and allow us as developers to see what users are doing with our library.

Just like CPU designers feed in a whole distribution of applications
and look at the probability of instruction selection and tweak instruction
to microcode mappings.

I am willing to accept a certain error in the process as long as I know
we are headed in the right direction. If we all disagree about the
direction we are going in then we should talk about it.

I see:

microbenchmarks -> whole system benchmarks -> profile driven optimizations

With the latter driving the set of tunnables we expose in the library,
and possibly even the functions selected by the IFUNC resolvers at
program startup.
 

>>> Taking care of common factors that can cause that is about ten times
>>> more complex than whole system benchmarking, analysis will be quite
>>> difficult as you will get twenty numbers and you will need to decide
>>> which ones could made real impact and which wont.
>>
>> Sorry, could you clarify this a bit more, exactly what is ten times
>> more complex?
>>
> Having benchmark suite that will catch all relevant factors that can
> affect performance. Some are hard to qualify for them we need to know
> how average program stresses resources.

I agree.

I would be happy to accept a patch that does:
* Shows the benchmark numbers.
* Explains relevant factors not caught by the benchmark that affect
  performance, what they are, and why the patch should go in.

My goal is to increase the quality of the written rationales for
performance related submissions.

> Take instruction cache usage, a function will occupy cache lines and we
> can accurately measure probability and cost of cache misses inside
> function. What is hard to estimate is how this will affect rest of
> program. For this we would need to know average probability that cache
> line will be referenced in future.

Good example.

>> If we have N tests and they produce N numbers, for a given target,
>> for a given device, for a given workload, there is a set of importance
>> weights on N that should give you some kind of relevance.
>>
> You are jumping to case when we will have these weights. Problematic
> part is getting those.

I agree.

It's hard to know the weights without having an intuitive understanding
of the applications you're running on your system and what's relevant
for their performance.

Perhaps my example of a weighted average is too abstract to use today.

>> We should be able to come up with some kind of framework from which
>> we can clearly say "this patch is better than this other patch", even
>> if not automated, it should be possible to reason from the results,
>> and that reasoning recorded as a discussion on this list.
>>
> What is possible is to say that some patch is significantly worse based
> on some criteria. There is lot of gray area where decision is unclear.

:-)

Cheers,
Carlos.
12