what is the application scenes of adding optimized Q-register for memcpy

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

what is the application scenes of adding optimized Q-register for memcpy

wangshuo (AF)
this commit 4a733bf375238a6a595033b5785cea7f27d61307 adds optimized
Q-register memcpy.
However, I can not get an ideal results in my enviornment. This is my test:

test suite: libMicro-0.4.0

./memcpy -E -C 200 -L -S -W -N "memcpy_10"    -s 10   -I 10
./memcpy -E -C 200 -L -S -W -N "memcpy_1k"    -s 1k   -I 50
./memcpy -E -C 200 -L -S -W -N "memcpy_10k"   -s 10k  -I 800
./memcpy -E -C 200 -L -S -W -N "memcpy_1m"    -s 1m   -I 500000
./memcpy -E -C 200 -L -S -W -N "memcpy_10m"   -s 10m  -I 5000000


hardware platform:
Kunpeng-920 @ 2600.0000MHz
L1d cache: 6 MiB
L1i cache: 6 MiB
L2 cache:  48 MiB
L3 cache:  192 MiB

            before this commit(usecs)         after this commit(usecs)
memcpy_10    0.0065                       0.0065
memcpy_1k    0.0299                       0.0294
memcpy_10k    0.2642                       0.2642
memcpy_1m    27.9040                       27.6480
memcpy_10m    265.9840                       274.6880
strlen_10    0.0039                       0.0039
strlen_1k    0.0571                       0.0450

I was wondering if you could give me some advices about my test results.
Reply | Threaded
Open this post in threaded view
|

Re: what is the application scenes of adding optimized Q-register for memcpy

Szabolcs Nagy-2
The 07/25/2020 09:49, wangshuo (AF) wrote:
> this commit 4a733bf375238a6a595033b5785cea7f27d61307 adds optimized
> Q-register memcpy.

please add "aarch64" to the email subject if it's
for aarch64 only.

that commit should not alter the kunpeng920 memcpy.
did you change the ifunc logic to use the new one?

the previous commit increases the entry alignment
and this one may move the memcpy to a slightly
different location in libc.so, but that's about it.

> However, I can not get an ideal results in my enviornment. This is my test:
>
> test suite: libMicro-0.4.0
>
> ./memcpy -E -C 200 -L -S -W -N "memcpy_10"    -s 10   -I 10
> ./memcpy -E -C 200 -L -S -W -N "memcpy_1k"    -s 1k   -I 50
> ./memcpy -E -C 200 -L -S -W -N "memcpy_10k"   -s 10k  -I 800
> ./memcpy -E -C 200 -L -S -W -N "memcpy_1m"    -s 1m   -I 500000
> ./memcpy -E -C 200 -L -S -W -N "memcpy_10m"   -s 10m  -I 5000000
>
>
> hardware platform:
> Kunpeng-920 @ 2600.0000MHz
> L1d cache: 6 MiB
> L1i cache: 6 MiB
> L2 cache:  48 MiB
> L3 cache:  192 MiB
>
>            before this commit(usecs)         after this commit(usecs)
> memcpy_10    0.0065                       0.0065
> memcpy_1k    0.0299                       0.0294
> memcpy_10k    0.2642                       0.2642
> memcpy_1m    27.9040                       27.6480
> memcpy_10m    265.9840                       274.6880
> strlen_10    0.0039                       0.0039
> strlen_1k    0.0571                       0.0450
>
> I was wondering if you could give me some advices about my test results.

3% regression on large copies may be explained by
uarch implementation internals, you can verify
that by keeping the code the same just add some
nop padding around memcpy.
Reply | Threaded
Open this post in threaded view
|

Re: what is the application scenes of adding optimized Q-register for memcpy

Wilco Dijkstra-2
In reply to this post by wangshuo (AF)
Hi Wangshuo,

As Szabolcs said, you didn't run the new GLIBC memcpy. The easiest way to
compare memcpy implementations in GLIBC is to use "make bench". After that
you can run the memcpy benchmarks directly, eg:

taskset -c 5 $BUILD_DIR/benchtests/bench-memcpy-random >out.txt

This produces results for the different memcpy implementations so you can see
which works best. I think the new memcpy_simd will work better on Kunpeng
than memcpy_falkor.

Cheers,
Wilco