this commit 4a733bf375238a6a595033b5785cea7f27d61307 adds optimized
Q-register memcpy. However, I can not get an ideal results in my enviornment. This is my test: test suite: libMicro-0.4.0 ./memcpy -E -C 200 -L -S -W -N "memcpy_10" -s 10 -I 10 ./memcpy -E -C 200 -L -S -W -N "memcpy_1k" -s 1k -I 50 ./memcpy -E -C 200 -L -S -W -N "memcpy_10k" -s 10k -I 800 ./memcpy -E -C 200 -L -S -W -N "memcpy_1m" -s 1m -I 500000 ./memcpy -E -C 200 -L -S -W -N "memcpy_10m" -s 10m -I 5000000 hardware platform: Kunpeng-920 @ 2600.0000MHz L1d cache: 6 MiB L1i cache: 6 MiB L2 cache: 48 MiB L3 cache: 192 MiB before this commit(usecs) after this commit(usecs) memcpy_10 0.0065 0.0065 memcpy_1k 0.0299 0.0294 memcpy_10k 0.2642 0.2642 memcpy_1m 27.9040 27.6480 memcpy_10m 265.9840 274.6880 strlen_10 0.0039 0.0039 strlen_1k 0.0571 0.0450 I was wondering if you could give me some advices about my test results. |
The 07/25/2020 09:49, wangshuo (AF) wrote:
> this commit 4a733bf375238a6a595033b5785cea7f27d61307 adds optimized > Q-register memcpy. please add "aarch64" to the email subject if it's for aarch64 only. that commit should not alter the kunpeng920 memcpy. did you change the ifunc logic to use the new one? the previous commit increases the entry alignment and this one may move the memcpy to a slightly different location in libc.so, but that's about it. > However, I can not get an ideal results in my enviornment. This is my test: > > test suite: libMicro-0.4.0 > > ./memcpy -E -C 200 -L -S -W -N "memcpy_10" -s 10 -I 10 > ./memcpy -E -C 200 -L -S -W -N "memcpy_1k" -s 1k -I 50 > ./memcpy -E -C 200 -L -S -W -N "memcpy_10k" -s 10k -I 800 > ./memcpy -E -C 200 -L -S -W -N "memcpy_1m" -s 1m -I 500000 > ./memcpy -E -C 200 -L -S -W -N "memcpy_10m" -s 10m -I 5000000 > > > hardware platform: > Kunpeng-920 @ 2600.0000MHz > L1d cache: 6 MiB > L1i cache: 6 MiB > L2 cache: 48 MiB > L3 cache: 192 MiB > > before this commit(usecs) after this commit(usecs) > memcpy_10 0.0065 0.0065 > memcpy_1k 0.0299 0.0294 > memcpy_10k 0.2642 0.2642 > memcpy_1m 27.9040 27.6480 > memcpy_10m 265.9840 274.6880 > strlen_10 0.0039 0.0039 > strlen_1k 0.0571 0.0450 > > I was wondering if you could give me some advices about my test results. 3% regression on large copies may be explained by uarch implementation internals, you can verify that by keeping the code the same just add some nop padding around memcpy. |
In reply to this post by Shuo Wang
Hi Wangshuo,
As Szabolcs said, you didn't run the new GLIBC memcpy. The easiest way to compare memcpy implementations in GLIBC is to use "make bench". After that you can run the memcpy benchmarks directly, eg: taskset -c 5 $BUILD_DIR/benchtests/bench-memcpy-random >out.txt This produces results for the different memcpy implementations so you can see which works best. I think the new memcpy_simd will work better on Kunpeng than memcpy_falkor. Cheers, Wilco |
Free forum by Nabble | Edit this page |