[PATCH 1/14] [x86_64] Vector math functions (vector cos)

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
Hi,

this patch starts series of patches with vectorized implementations of
several math functions for x86_64.

Here is implementation of cos containing SSE, AVX, AVX2 and AVX512
versions according to Vector ABI which was discussed in
https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4.

This patch also adds vector math library build and ABI check enabled
by default for x86_64.

2015-04-30  Andrew Senkevich  <[hidden email]>

        * sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
        * sysdeps/x86_64/fpu/Versions: New file.
        * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
        * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
        * sysdeps/x86_64/configure: Regenerated.
        * sysdeps/x86_64/configure.ac: Options for libmvec build.
        * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.
        * sysdeps/x86_64/fpu/Makefile: New file.
        * sysdeps/x86_64/fpu/multiarch/Makefile: Added build of SSE version
        which is IFUNC.

diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h
index e8e5577..85a6a95 100644
--- a/math/bits/mathcalls.h
+++ b/math/bits/mathcalls.h
@@ -60,7 +60,7 @@ __MATHCALL (atan,, (_Mdouble_ __x));
 __MATHCALL (atan2,, (_Mdouble_ __y, _Mdouble_ __x));

 /* Cosine of X.  */
-__MATHCALL (cos,, (_Mdouble_ __x));
+__MATHCALL_VEC (cos,, (_Mdouble_ __x));
 /* Sine of X.  */
 __MATHCALL (sin,, (_Mdouble_ __x));
 /* Tangent of X.  */
diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
new file mode 100644
index 0000000..be6eaed
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
@@ -0,0 +1,6 @@
+GLIBC_2.22
+ GLIBC_2.22 A
+ _ZGVbN2v_cos F
+ _ZGVcN4v_cos F
+ _ZGVdN4v_cos F
+ _ZGVeN8v_cos F
diff --git a/sysdeps/x86/fpu/bits/math-vector.h
b/sysdeps/x86/fpu/bits/math-vector.h
new file mode 100644
index 0000000..6540268
--- /dev/null
+++ b/sysdeps/x86/fpu/bits/math-vector.h
@@ -0,0 +1,34 @@
+/* Platform-specific SIMD declarations of math functions.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _MATH_H
+# error "Never include <bits/math-vector.h> directly;\
+ include <math.h> instead."
+#endif
+
+/* Get default empty definitions for simd declarations.  */
+#include <bits/libm-simd-decl-stubs.h>
+
+#if defined __x86_64__ && defined __FAST_MATH__
+# if defined _OPENMP && _OPENMP >= 201307
+/* OpenMP case.  */
+#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
+#  undef __DECL_SIMD_cos
+#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
+# endif
+#endif
diff --git a/sysdeps/x86_64/configure b/sysdeps/x86_64/configure
index 7d4dadd..1493523 100644
--- a/sysdeps/x86_64/configure
+++ b/sysdeps/x86_64/configure
@@ -275,6 +275,10 @@ fi
 config_vars="$config_vars
 config-cflags-avx2 = $libc_cv_cc_avx2"

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h

 # work around problem with autoconf and empty lines at the end of files
diff --git a/sysdeps/x86_64/configure.ac b/sysdeps/x86_64/configure.ac
index c9f9a51..1c2b35f 100644
--- a/sysdeps/x86_64/configure.ac
+++ b/sysdeps/x86_64/configure.ac
@@ -99,6 +99,10 @@ if test $libc_cv_cc_avx2 = yes; then
 fi
 LIBC_CONFIG_VAR([config-cflags-avx2], [$libc_cv_cc_avx2])

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 dnl It is always possible to access static and hidden symbols in an
 dnl position independent way.
 AC_DEFINE(PI_STATIC_AND_HIDDEN)
diff --git a/sysdeps/x86_64/fpu/Makefile b/sysdeps/x86_64/fpu/Makefile
new file mode 100644
index 0000000..c24188f
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Makefile
@@ -0,0 +1,5 @@
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core_sse svml_d_cos4_core_avx \
+                  svml_d_cos4_core_avx2 svml_d_cos8_core_avx512 \
+                  svml_d_cos_data init-arch
+endif
diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions
new file mode 100644
index 0000000..b38ed07
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Versions
@@ -0,0 +1,8 @@
+libmvec {
+  GLIBC_2.22 {
+    _ZGVbN2v_cos;
+    _ZGVcN4v_cos;
+    _ZGVdN4v_cos;
+    _ZGVeN8v_cos;
+  }
+}
diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile
b/sysdeps/x86_64/fpu/multiarch/Makefile
index 12b0526..062d8ce 100644
--- a/sysdeps/x86_64/fpu/multiarch/Makefile
+++ b/sysdeps/x86_64/fpu/multiarch/Makefile
@@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
 CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
 endif
 endif
+
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core
+endif
diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
new file mode 100644
index 0000000..29c6afd
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
@@ -0,0 +1,34 @@
+/* Multiple versions of vectorized cos.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+
+       .text
+ENTRY(_ZGVbN2v_cos)
+        .type   _ZGVbN2v_cos, @gnu_indirect_function
+        cmpl    $0, KIND_OFFSET+__cpu_features(%rip)
+        jne     1f
+        call    __init_cpu_features
+1:      leaq    _ZGVbN2v_cos_sse4(%rip), %rax
+        testl   $bit_SSE4_1, __cpu_features+CPUID_OFFSET+index_SSE4_1(%rip)
+        jz      2f
+        ret
+2:      leaq    _ZGVbN2v_cos_sse2(%rip), %rax
+        ret
+END(_ZGVbN2v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
new file mode 100644
index 0000000..16da39b
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
@@ -0,0 +1,225 @@
+/* Function cos vectorized with SSE2 and SSE4.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+#include "svml_d_cos_data.h"
+
+       .text
+ENTRY(_ZGVbN2v_cos_sse2)
+/* SSE2 version as wrapper to scalar.  */
+        subq      $40, %rsp
+        movaps    %xmm0, (%rsp)
+        call      cos@PLT
+        movsd     %xmm0, 16(%rsp)
+        movsd     8(%rsp), %xmm0
+        call      cos@PLT
+        movsd     16(%rsp), %xmm1
+        movsd     %xmm0, 24(%rsp)
+        unpcklpd  %xmm0, %xmm1
+        movaps    %xmm1, %xmm0
+        addq      $40, %rsp
+        ret
+END(_ZGVbN2v_cos_sse2)
+
+ENTRY(_ZGVbN2v_cos_sse4)
+/* ALGORITHM DESCRIPTION:
+ *
+ *      ( low accuracy ( < 4ulp ) or enhanced performance ( half of correct
+ *      mantissa ) implementation )
+ *
+ *      Argument representation:
+ *      arg + Pi/2 = (N*Pi + R)
+ *
+ *      Result calculation:
+ *      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+ *      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        movq      %rsp, %rbp
+        andq      $-64, %rsp
+        subq      $320, %rsp
+        movaps    %xmm0, %xmm3
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        movups    __dHalfPI(%rax), %xmm2
+
+/* ARGUMENT RANGE REDUCTION:
+ * Add Pi/2 to argument: X' = X+Pi/2
+ */
+        addpd     %xmm3, %xmm2
+        movups    __dInvPI(%rax), %xmm5
+        movups    __dAbsMask(%rax), %xmm4
+
+/* Get absolute argument value: X' = |X'| */
+        andps     %xmm2, %xmm4
+
+/* Y = X'*InvPi + RS : right shifter add */
+        mulpd     %xmm5, %xmm2
+
+/* Check for large arguments path */
+        cmpnlepd  __dRangeVal(%rax), %xmm4
+        movups    __dRShifter(%rax), %xmm6
+        addpd     %xmm6, %xmm2
+        movmskpd  %xmm4, %ecx
+
+/* N = Y - RS : right shifter sub */
+        movaps    %xmm2, %xmm1
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        psllq     $63, %xmm2
+        subpd     %xmm6, %xmm1
+
+/* N = N - 0.5 */
+        subpd     __dOneHalf(%rax), %xmm1
+        movups    __dPI1(%rax), %xmm7
+
+/* R = X - N*Pi1 */
+        mulpd     %xmm1, %xmm7
+        movups    __dPI2(%rax), %xmm4
+
+/* R = R - N*Pi2 */
+        mulpd     %xmm1, %xmm4
+        subpd     %xmm7, %xmm0
+        movups    __dPI3(%rax), %xmm5
+
+/* R = R - N*Pi3 */
+        mulpd     %xmm1, %xmm5
+        subpd     %xmm4, %xmm0
+
+/* R = R - N*Pi4 */
+        movups     __dPI4(%rax), %xmm6
+        mulpd     %xmm6, %xmm1
+        subpd     %xmm5, %xmm0
+        subpd     %xmm1, %xmm0
+
+/* POLYNOMIAL APPROXIMATION:
+ * R2 = R*R
+ */
+        movaps    %xmm0, %xmm4
+        mulpd     %xmm0, %xmm4
+        movups    __dC7(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC6(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC5(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC4(%rax), %xmm1
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC3(%rax), %xmm1
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC2(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC1(%rax), %xmm1
+        mulpd     %xmm1, %xmm4
+        mulpd     %xmm0, %xmm4
+        addpd     %xmm4, %xmm0
+
+/* RECONSTRUCTION:
+ * Final sign setting: Res = Poly^SignRes
+ */
+        xorps     %xmm2, %xmm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        movq      %rbp, %rsp
+        popq      %rbp
+        ret
+
+.LBL_1_3:
+        movups    %xmm3, 192(%rsp)
+        movups    %xmm0, 256(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        movups    %xmm8, 112(%rsp)
+        movups    %xmm9, 96(%rsp)
+        movups    %xmm10, 80(%rsp)
+        movups    %xmm11, 64(%rsp)
+        movups    %xmm12, 48(%rsp)
+        movups    %xmm13, 32(%rsp)
+        movups    %xmm14, 16(%rsp)
+        movups    %xmm15, (%rsp)
+        movq      %rsi, 136(%rsp)
+        movq      %rdi, 128(%rsp)
+        movq      %r12, 168(%rsp)
+        movb      %dl, %r12b
+        movq      %r13, 160(%rsp)
+        movl      %ecx, %r13d
+        movq      %r14, 152(%rsp)
+        movl      %eax, %r14d
+        movq      %r15, 144(%rsp)
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        movups    112(%rsp), %xmm8
+        movups    96(%rsp), %xmm9
+        movups    80(%rsp), %xmm10
+        movups    64(%rsp), %xmm11
+        movups    48(%rsp), %xmm12
+        movups    32(%rsp), %xmm13
+        movups    16(%rsp), %xmm14
+        movups    (%rsp), %xmm15
+        movq      136(%rsp), %rsi
+        movq      128(%rsp), %rdi
+        movq      168(%rsp), %r12
+        movq      160(%rsp), %r13
+        movq      152(%rsp), %r14
+        movq      144(%rsp), %r15
+        movups    256(%rsp), %xmm0
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     200(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 264(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     192(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 256(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END(_ZGVbN2v_cos_sse4)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
new file mode 100644
index 0000000..5c24f42
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
@@ -0,0 +1,39 @@
+/* Function cos vectorized in AVX ISA as wrapper to SSE4 ISA version.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+       .text
+ENTRY(_ZGVcN4v_cos)
+        pushq          %rbp
+        movq           %rsp, %rbp
+        andq           $-32, %rsp
+        subq           $32, %rsp
+        vextractf128   $1, %ymm0, (%rsp)
+        vzeroupper
+        call           _ZGVbN2v_cos@PLT
+        vmovapd                %xmm0, 16(%rsp)
+        vmovaps                (%rsp), %xmm0
+        call           _ZGVbN2v_cos@PLT
+        vmovapd                %xmm0, %xmm1
+        vmovapd                16(%rsp), %xmm0
+        vinsertf128    $1, %xmm1, %ymm0, %ymm0
+        movq           %rbp, %rsp
+        popq           %rbp
+        ret
+END(_ZGVcN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
new file mode 100644
index 0000000..28f661c
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
@@ -0,0 +1,193 @@
+/* Function cos vectorized with AVX2.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+
+       .text
+ENTRY(_ZGVdN4v_cos)
+
+/* ALGORITHM DESCRIPTION:
+ *
+ *    ( low accuracy ( < 4ulp ) or enhanced performance
+ *      ( half of correct mantissa ) implementation )
+ *
+ *    Argument representation:
+ *    arg + Pi/2 = (N*Pi + R)
+ *
+ *    Result calculation:
+ *    cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+ *    sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        movq      %rsp, %rbp
+        andq      $-64, %rsp
+        subq      $448, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        vmovapd   %ymm0, %ymm1
+        vmovupd __dInvPI(%rax), %ymm4
+        vmovupd __dRShifter(%rax), %ymm5
+
+/*
+ * ARGUMENT RANGE REDUCTION:
+ * Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %ymm1, %ymm7
+
+/* Get absolute argument value: X' = |X'| */
+        vandpd __dAbsMask(%rax), %ymm7, %ymm2
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd %ymm5, %ymm4, %ymm7
+        vmovupd __dC7(%rax), %ymm4
+
+/* Check for large arguments path */
+        vcmpnle_uqpd __dRangeVal(%rax), %ymm2, %ymm3
+
+/* N = Y - RS : right shifter sub */
+        vsubpd    %ymm5, %ymm7, %ymm6
+        vmovupd __dPI1_FMA(%rax), %ymm2
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %ymm7, %ymm7
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %ymm6, %ymm0
+        vmovmskpd %ymm3, %ecx
+
+/* R = X - N*Pi1 */
+        vmovapd   %ymm1, %ymm3
+        vfnmadd231pd %ymm0, %ymm2, %ymm3
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %ymm0, %ymm3
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %ymm3, %ymm0
+
+/*
+ * POLYNOMIAL APPROXIMATION:
+ * R2 = R*R
+ */
+        vmulpd    %ymm0, %ymm0, %ymm5
+        vfmadd213pd __dC6(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC5(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC4(%rax), %ymm5, %ymm4
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %ymm5, %ymm4
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC1(%rax), %ymm5, %ymm4
+        vmulpd    %ymm5, %ymm4, %ymm6
+        vfmadd213pd %ymm0, %ymm0, %ymm6
+
+/*
+ * RECONSTRUCTION:
+ * Final sign setting: Res = Poly^SignRes
+ */
+        vxorpd    %ymm7, %ymm6, %ymm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        movq      %rbp, %rsp
+        popq      %rbp
+        ret
+
+.LBL_1_3:
+        vmovupd   %ymm1, 320(%rsp)
+        vmovupd   %ymm0, 384(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        vmovups   %ymm8, 224(%rsp)
+        vmovups   %ymm9, 192(%rsp)
+        vmovups   %ymm10, 160(%rsp)
+        vmovups   %ymm11, 128(%rsp)
+        vmovups   %ymm12, 96(%rsp)
+        vmovups   %ymm13, 64(%rsp)
+        vmovups   %ymm14, 32(%rsp)
+        vmovups   %ymm15, (%rsp)
+        movq      %rsi, 264(%rsp)
+        movq      %rdi, 256(%rsp)
+        movq      %r12, 296(%rsp)
+        movb      %dl, %r12b
+        movq      %r13, 288(%rsp)
+        movl      %ecx, %r13d
+        movq      %r14, 280(%rsp)
+        movl      %eax, %r14d
+        movq      %r15, 272(%rsp)
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        vmovups   224(%rsp), %ymm8
+        vmovups   192(%rsp), %ymm9
+        vmovups   160(%rsp), %ymm10
+        vmovups   128(%rsp), %ymm11
+        vmovups   96(%rsp), %ymm12
+        vmovups   64(%rsp), %ymm13
+        vmovups   32(%rsp), %ymm14
+        vmovups   (%rsp), %ymm15
+        vmovupd   384(%rsp), %ymm0
+        movq      264(%rsp), %rsi
+        movq      256(%rsp), %rdi
+        movq      296(%rsp), %r12
+        movq      288(%rsp), %r13
+        movq      280(%rsp), %r14
+        movq      272(%rsp), %r15
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    328(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 392(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    320(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 384(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END(_ZGVdN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
new file mode 100644
index 0000000..cb65750
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
@@ -0,0 +1,247 @@
+/* Function cos vectorized with AVX-512, KNL version.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+
+       .text
+ENTRY(_ZGVeN8v_cos)
+#ifndef HAVE_AVX512_ASM_SUPPORT
+/* Implemented as wrapper to AVX2 version.  */
+        pushq  %rbp
+        movq   %rsp, %rbp
+        andq   $-64, %rsp
+        subq   $64, %rsp
+/* Below is encoding for vmovaps %zmm0, (%rsp).  */
+        .byte  0x62
+        .byte  0xf1
+        .byte  0x7c
+        .byte  0x48
+        .byte  0x29
+        .byte  0x04
+        .byte  0x24
+/* Below is encoding for vmovapd (%rsp), %ymm0.  */
+        .byte  0xc5
+        .byte  0xfd
+        .byte  0x28
+        .byte  0x04
+        .byte  0x24
+        call   _ZGVdN4v_cos@PLT
+/* Below is encoding for vmovapd 32(%rsp), %ymm0.  */
+        .byte  0xc5
+        .byte  0xfd
+        .byte  0x28
+        .byte  0x44
+        .byte  0x24
+        .byte  0x20
+        call   _ZGVdN4v_cos@PLT
+        movq   %rbp, %rsp
+        popq   %rbp
+        ret
+#else
+/*
+ * ALGORITHM DESCRIPTION:
+ *
+ *      ( low accuracy ( < 4ulp ) or
+ *        enhanced performance ( half of correct mantissa ) implementation )
+ *
+ *      Argument representation:
+ *      arg + Pi/2 = (N*Pi + R)
+ *
+ *      Result calculation:
+ *      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+ *      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        movq      %rsp, %rbp
+        andq      $-64, %rsp
+        subq      $1280, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+
+/* R = X - N*Pi1 */
+        vmovaps   %zmm0, %zmm7
+
+/* Check for large arguments path */
+        movq      $-1, %rcx
+
+/*
+ * ARGUMENT RANGE REDUCTION:
+ * Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %zmm0, %zmm5
+        vmovups __dInvPI(%rax), %zmm3
+
+/* Get absolute argument value: X' = |X'| */
+        vpandq __dAbsMask(%rax), %zmm5, %zmm1
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd __dRShifter(%rax), %zmm3, %zmm5
+        vmovups __dPI1_FMA(%rax), %zmm6
+
+/* N = Y - RS : right shifter sub */
+        vsubpd __dRShifter(%rax), %zmm5, %zmm4
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %zmm5, %zmm12
+        vmovups __dC7(%rax), %zmm8
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %zmm4, %zmm10
+        vcmppd    $22, __dRangeVal(%rax), %zmm1, %k1
+        vpbroadcastq %rcx, %zmm2{%k1}{z}
+        vfnmadd231pd %zmm10, %zmm6, %zmm7
+        vptestmq  %zmm2, %zmm2, %k0
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %zmm10, %zmm7
+        kmovw     %k0, %ecx
+        movzbl    %cl, %ecx
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %zmm7, %zmm10
+
+/*
+ * POLYNOMIAL APPROXIMATION:
+ * R2 = R*R
+ */
+        vmulpd    %zmm10, %zmm10, %zmm9
+        vfmadd213pd __dC6(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC5(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC4(%rax), %zmm9, %zmm8
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %zmm9, %zmm8
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC1(%rax), %zmm9, %zmm8
+        vmulpd    %zmm9, %zmm8, %zmm11
+        vfmadd213pd %zmm10, %zmm10, %zmm11
+
+/*
+ * RECONSTRUCTION:
+ * Final sign setting: Res = Poly^SignRes
+ */
+        vpxorq    %zmm12, %zmm11, %zmm1
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        vmovaps   %zmm1, %zmm0
+        movq      %rbp, %rsp
+        popq      %rbp
+        ret
+
+.LBL_1_3:
+        vmovups   %zmm0, 1152(%rsp)
+        vmovups   %zmm1, 1216(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        kmovw     %k4, 1048(%rsp)
+        xorl      %eax, %eax
+        kmovw     %k5, 1040(%rsp)
+        kmovw     %k6, 1032(%rsp)
+        kmovw     %k7, 1024(%rsp)
+        vmovups   %zmm16, 960(%rsp)
+        vmovups   %zmm17, 896(%rsp)
+        vmovups   %zmm18, 832(%rsp)
+        vmovups   %zmm19, 768(%rsp)
+        vmovups   %zmm20, 704(%rsp)
+        vmovups   %zmm21, 640(%rsp)
+        vmovups   %zmm22, 576(%rsp)
+        vmovups   %zmm23, 512(%rsp)
+        vmovups   %zmm24, 448(%rsp)
+        vmovups   %zmm25, 384(%rsp)
+        vmovups   %zmm26, 320(%rsp)
+        vmovups   %zmm27, 256(%rsp)
+        vmovups   %zmm28, 192(%rsp)
+        vmovups   %zmm29, 128(%rsp)
+        vmovups   %zmm30, 64(%rsp)
+        vmovups   %zmm31, (%rsp)
+        movq      %rsi, 1064(%rsp)
+        movq      %rdi, 1056(%rsp)
+        movq      %r12, 1096(%rsp)
+        movb      %dl, %r12b
+        movq      %r13, 1088(%rsp)
+        movl      %ecx, %r13d
+        movq      %r14, 1080(%rsp)
+        movl      %eax, %r14d
+        movq      %r15, 1072(%rsp)
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        addb      $1, %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        kmovw     1048(%rsp), %k4
+        movq      1064(%rsp), %rsi
+        kmovw     1040(%rsp), %k5
+        movq      1056(%rsp), %rdi
+        kmovw     1032(%rsp), %k6
+        movq      1096(%rsp), %r12
+        movq      1088(%rsp), %r13
+        kmovw     1024(%rsp), %k7
+        vmovups   960(%rsp), %zmm16
+        vmovups   896(%rsp), %zmm17
+        vmovups   832(%rsp), %zmm18
+        vmovups   768(%rsp), %zmm19
+        vmovups   704(%rsp), %zmm20
+        vmovups   640(%rsp), %zmm21
+        vmovups   576(%rsp), %zmm22
+        vmovups   512(%rsp), %zmm23
+        vmovups   448(%rsp), %zmm24
+        vmovups   384(%rsp), %zmm25
+        vmovups   320(%rsp), %zmm26
+        vmovups   256(%rsp), %zmm27
+        vmovups   192(%rsp), %zmm28
+        vmovups   128(%rsp), %zmm29
+        vmovups   64(%rsp), %zmm30
+        vmovups   (%rsp), %zmm31
+        movq      1080(%rsp), %r14
+        movq      1072(%rsp), %r15
+        vmovups   1216(%rsp), %zmm1
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1160(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1224(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1152(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1216(%rsp,%r15)
+        jmp       .LBL_1_7
+#endif
+END(_ZGVeN8v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.S
b/sysdeps/x86_64/fpu/svml_d_cos_data.S
new file mode 100644
index 0000000..6f36ecd
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.S
@@ -0,0 +1,114 @@
+/* Data for vectorized cos.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "svml_d_cos_data.h"
+
+.macro double_vector offset value
+.if .-__svml_dcos_data != \offset
+.err
+.endif
+.rept 8
+.quad \value
+.endr
+.endm
+
+       .section .rodata, "a"
+       .align 64
+
+/* Data table for vector implementations of function cos.
+ * The table may contain polynomial, reduction, lookup
+ * coefficients and other constants obtained through different
+ * methods of research and experimental work.
+ */
+       .globl __svml_dcos_data
+__svml_dcos_data:
+
+/* General purpose constants:
+ * absolute value mask
+ */
+double_vector __dAbsMask 0x7fffffffffffffff
+
+/* working range threshold */
+double_vector __dRangeVal 0x4160000000000000
+
+/* PI/2 */
+double_vector __dHalfPI 0x3ff921fb54442d18
+
+/* 1/PI */
+double_vector __dInvPI 0x3fd45f306dc9c883
+
+/* right-shifter constant */
+double_vector __dRShifter 0x4338000000000000
+
+/* 0.5 */
+double_vector __dOneHalf 0x3fe0000000000000
+
+/* Range reduction PI-based constants:
+ * PI high part
+ */
+double_vector __dPI1 0x400921fb40000000
+
+/* PI mid  part 1 */
+double_vector __dPI2 0x3e84442d00000000
+
+/* PI mid  part 2 */
+double_vector __dPI3 0x3d08469880000000
+
+/* PI low  part */
+double_vector __dPI4 0x3b88cc51701b839a
+
+/* Range reduction PI-based constants if FMA available:
+ * PI high part (FMA available)
+ */
+double_vector __dPI1_FMA 0x400921fb54442d18
+
+/* PI mid part  (FMA available) */
+double_vector __dPI2_FMA 0x3ca1a62633145c06
+
+/* PI low part  (FMA available) */
+double_vector __dPI3_FMA 0x395c1cd129024e09
+
+/* Polynomial coefficients (relative error 2^(-52.115)): */
+double_vector __dC1 0xbfc55555555554a7
+double_vector __dC2 0x3f8111111110a4a8
+double_vector __dC3 0xbf2a01a019a5b86d
+double_vector __dC4 0x3ec71de38030fea0
+double_vector __dC5 0xbe5ae63546002231
+double_vector __dC6 0x3de60e6857a2f220
+double_vector __dC7 0xbd69f0d60811aac8
+
+/*
+ * Additional constants:
+ * absolute value mask
+ */
+double_vector __dAbsMask_la 0x7fffffffffffffff
+
+/* 1/PI */
+double_vector __dInvPI_la 0x3fd45f306dc9c883
+
+/* right-shifer for low accuracy version */
+double_vector __dRShifter_la 0x4330000000000000
+
+/* right-shifer-1.0 for low accuracy version */
+double_vector __dRShifterm5_la 0x432fffffffffffff
+
+/* right-shifer with low mask for low accuracy version */
+double_vector __dRXmax_la 0x43300000007ffffe
+
+       .type   __svml_dcos_data,@object
+       .size   __svml_dcos_data,.-__svml_dcos_data
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.h
b/sysdeps/x86_64/fpu/svml_d_cos_data.h
new file mode 100644
index 0000000..21fa806
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.h
@@ -0,0 +1,48 @@
+/* Offsets for data table for vectorized cos.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef D_COS_DATA_H
+#define D_COS_DATA_H
+
+#define __dAbsMask              0
+#define __dRangeVal             64
+#define __dHalfPI               128
+#define __dInvPI                192
+#define __dRShifter             256
+#define __dOneHalf              320
+#define __dPI1                  384
+#define __dPI2                  448
+#define __dPI3                  512
+#define __dPI4                  576
+#define __dPI1_FMA              640
+#define __dPI2_FMA              704
+#define __dPI3_FMA              768
+#define __dC1                   832
+#define __dC2                   896
+#define __dC3                   960
+#define __dC4                   1024
+#define __dC5                   1088
+#define __dC6                   1152
+#define __dC7                   1216
+#define __dAbsMask_la           1280
+#define __dInvPI_la             1344
+#define __dRShifter_la          1408
+#define __dRShifterm5_la        1472
+#define __dRXmax_la             1536
+
+#endif

Is this patch ok?


--
WBR,
Andrew

0001-x86-64-vector-cos.patch (44K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
On Thu, 30 Apr 2015, Andrew Senkevich wrote:

> +   Copyright (C) 2014 Free Software Foundation, Inc.

2014-2015 in all new files.

> +ENTRY(_ZGVbN2v_cos_sse4)
> +/* ALGORITHM DESCRIPTION:
> + *
> + *      ( low accuracy ( < 4ulp ) or enhanced performance ( half of correct
> + *      mantissa ) implementation )
> + *
> + *      Argument representation:
> + *      arg + Pi/2 = (N*Pi + R)
> + *
> + *      Result calculation:
> + *      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
> + *      sin(R) is approximated by corresponding polynomial
> + */
> +        pushq     %rbp
> +        movq      %rsp, %rbp
> +        andq      $-64, %rsp
> +        subq      $320, %rsp

Stack adjustments should have CFI that accurately describes how to unwind
the stack on any instruction boundary (try stepping by instruction in GDB
and making sure you can get an accurate backtrace after every step).  
Probably something like

pushq %rbp
cfi_adjust_cfa_offset (8)
cfi_rel_offset (%rbp, 0)
movq %rsp, %rbp
cfi_def_cfa_register (%rbp)

(having defined %rbp to be the CFA, after that %rsp adjustments don't need
further CFI)

...

movq %rbp, %rsp
cfi_def_cfa_register (%rsp)
popq %rbp
cfi_adjust_cfa_offset (-8)
cfi_restore (%rbp)

(but since you jump past code like that, you need to save and restore CFI
state appropriately for that).

If you save, clobber and restore any call-preserved registers, CFI is also
needed for that.  Again, you need to take care about jumps to ensure the
CFI state is correct at each label that may be a jump target.

> +ENTRY(_ZGVcN4v_cos)
> +        pushq          %rbp
> +        movq           %rsp, %rbp
> +        andq           $-32, %rsp
> +        subq           $32, %rsp
> +        vextractf128   $1, %ymm0, (%rsp)
> +        vzeroupper
> +        call           _ZGVbN2v_cos@PLT
> +        vmovapd                %xmm0, 16(%rsp)
> +        vmovaps                (%rsp), %xmm0
> +        call           _ZGVbN2v_cos@PLT

Does this pass testing (specifically, the localplt test in the elf/
directory)?  I'd expect you to need to use HIDDEN_JUMPTARGET when calling
any symbol exported from libmvec (together with libm_hidden_def /
libm_hidden_weak on the definition of the function called, if not already
present).

If it passes despite there being unnecessary local PLT references in
libmvec, that points to something missing from the earlier patches:
libmvec should be included in the libraries for which the localplt test
runs.

> +/* Data table for vector implementations of function cos.
> + * The table may contain polynomial, reduction, lookup
> + * coefficients and other constants obtained through different
> + * methods of research and experimental work.

Comments in glibc don't use an initial '*' on each line after the first.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
Hi Joseph,

here is fixed patch with some additions (wrapper implementations goes
to separate header and AVX512 version now has 2 implementations).


--
WBR,
Andrew

0001-x86-64-vector-cos.patch (59K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
On Fri, 15 May 2015, Andrew Senkevich wrote:

> diff --git a/include/libc-symbols.h b/include/libc-symbols.h
> index ca3fe00..8992e9e 100644
> --- a/include/libc-symbols.h
> +++ b/include/libc-symbols.h
> @@ -526,7 +526,7 @@ for linking")
>  # define rtld_hidden_data_ver(local, name)
>  #endif
>  
> -#if IS_IN (libm)
> +#if IS_IN (libm) || IS_IN(libmvec)
>  # define libm_hidden_proto(name, attrs...) hidden_proto (name, ##attrs)
>  # define libm_hidden_tls_proto(name, attrs...) hidden_tls_proto (name, ##attrs)
>  # define libm_hidden_def(name) hidden_def (name)

I think it would be better to have a completely separate set of
libmvec_hidden_* definitions (sorry for misleading you in my previous
comments by referring to libm_hidden_*).  Also remember the space before
'(' in the IS_IN call, and in other macro calls such as ENTRY, END and
libmvec_hidden_def.

> +        jne       .LBL_1_3
> +
> +.LBL_1_2:
> +        movq      %rbp, %rsp
> +        cfi_def_cfa_register (%rsp)
> +        popq      %rbp
> +        cfi_adjust_cfa_offset (-8)
> +        cfi_restore (%rbp)
> +        ret
> +
> +.LBL_1_3:
> +        movups    %xmm3, 192(%rsp)

This won't yield correct CFI at .LBL_1_3 and afterwards.  At .LBL_1_3, the
correct CFI is the same as that at .LBL_1_2 - *not* the same as that at
the ret instruction.  So you should do cfi_remember_state before the CFI
changes here, and cfi_restore_state after ret, for the CFI to be correct
for the rest of the function.

> +        movups    %xmm0, 256(%rsp)
> +        je        .LBL_1_2
> +
> +        xorb      %dl, %dl
> +        xorl      %eax, %eax
> +        movups    %xmm8, 112(%rsp)
> +        movups    %xmm9, 96(%rsp)
> +        movups    %xmm10, 80(%rsp)
> +        movups    %xmm11, 64(%rsp)
> +        movups    %xmm12, 48(%rsp)
> +        movups    %xmm13, 32(%rsp)
> +        movups    %xmm14, 16(%rsp)
> +        movups    %xmm15, (%rsp)
> +        movq      %rsi, 136(%rsp)
> +        movq      %rdi, 128(%rsp)
> +        movq      %r12, 168(%rsp)
> +        movb      %dl, %r12b
> +        movq      %r13, 160(%rsp)
> +        movl      %ecx, %r13d
> +        movq      %r14, 152(%rsp)
> +        movl      %eax, %r14d
> +        movq      %r15, 144(%rsp)

The ABI says that r12-r15 are callee-saved registers.  That means you need
CFI describing when they are saved and restored (and again need
cfi_remember_state and cfi_restore_state to deal with how some code that's
physically after the restoration is logically before it, and so needs the
CFI describing how those registers are saved on the stack).

Similar issues apply with the CFI in other function implementations here.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Szabolcs Nagy-2
In reply to this post by Andrew Senkevich
On 15/05/15 16:34, Andrew Senkevich wrote:

> diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h
> new file mode 100644
> index 0000000..27294ce
> --- /dev/null
> +++ b/sysdeps/x86/fpu/bits/math-vector.h
> @@ -0,0 +1,34 @@
> +/* Platform-specific SIMD declarations of math functions.
> +   Copyright (C) 2014-2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _MATH_H
> +# error "Never include <bits/math-vector.h> directly;\
> + include <math.h> instead."
> +#endif
> +
> +/* Get default empty definitions for simd declarations.  */
> +#include <bits/libm-simd-decl-stubs.h>
> +
> +#if defined __x86_64__ && defined __FAST_MATH__
> +# if defined _OPENMP && _OPENMP >= 201307
> +/* OpenMP case.  */
> +#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
> +#  undef __DECL_SIMD_cos
> +#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
> +# endif
> +#endif

why does math.h have #ifdef _OPENMP ? (and __FAST_MATH__)

can this be implemented in an openmp runtime instead?

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
On Fri, 15 May 2015, Szabolcs Nagy wrote:

> > +#if defined __x86_64__ && defined __FAST_MATH__
> > +# if defined _OPENMP && _OPENMP >= 201307
> > +/* OpenMP case.  */
> > +#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
> > +#  undef __DECL_SIMD_cos
> > +#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
> > +# endif
> > +#endif
>
> why does math.h have #ifdef _OPENMP ? (and __FAST_MATH__)

Because this pragma to declare that the vector functions are available
(for the compiler to vectorize calls to scalar functions) is an OpenMP
pragma, only available when OpenMP is enabled.  They might also be made
available in other cases in future when appropriate pragmas (e.g.
CilkPlus) are available.  __FAST_MATH__ is because these functions may not
meet the normal glibc rules regarding accuracy of results and exceptions.

See the discussions starting last September of the approach for adding
vector libm functions.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
In reply to this post by Joseph Myers
2015-05-15 19:32 GMT+03:00 Joseph Myers <[hidden email]>:

> On Fri, 15 May 2015, Andrew Senkevich wrote:
>
>> +        movups    %xmm0, 256(%rsp)
>> +        je        .LBL_1_2
>> +
>> +        xorb      %dl, %dl
>> +        xorl      %eax, %eax
>> +        movups    %xmm8, 112(%rsp)
>> +        movups    %xmm9, 96(%rsp)
>> +        movups    %xmm10, 80(%rsp)
>> +        movups    %xmm11, 64(%rsp)
>> +        movups    %xmm12, 48(%rsp)
>> +        movups    %xmm13, 32(%rsp)
>> +        movups    %xmm14, 16(%rsp)
>> +        movups    %xmm15, (%rsp)
>> +        movq      %rsi, 136(%rsp)
>> +        movq      %rdi, 128(%rsp)
>> +        movq      %r12, 168(%rsp)
>> +        movb      %dl, %r12b
>> +        movq      %r13, 160(%rsp)
>> +        movl      %ecx, %r13d
>> +        movq      %r14, 152(%rsp)
>> +        movl      %eax, %r14d
>> +        movq      %r15, 144(%rsp)
>
> The ABI says that r12-r15 are callee-saved registers.  That means you need
> CFI describing when they are saved and restored (and again need
> cfi_remember_state and cfi_restore_state to deal with how some code that's
> physically after the restoration is logically before it, and so needs the
> CFI describing how those registers are saved on the stack).

How to calculate offset for .cfi_offset directive after f.e. movq
%r12, 168(%rsp)?

If not to have %rbp as CFA but stay %rsp, how to specify the proper
CFI after         andq      $-64, %rsp?


--
WBR,
Andrew
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
On Mon, 18 May 2015, Andrew Senkevich wrote:

> How to calculate offset for .cfi_offset directive after f.e. movq
> %r12, 168(%rsp)?

I think you'll need to use .cfi_escape in order to generate
DW_CFA_expression describing the register as saved relative to %rsp
instead of relative to the CFA.  Something like:

DW_CFA_expression
uleb128 register number (as in the x86_64 ABI)
uleb128 length of what follows
DW_OP_drop
DW_OP_breg7 (%rsp is register 7)
sleb128 offset from %rsp

appropriately encoded by hand and entered with .cfi_escape (with
appropriate comments saying what operations are being encoded).  You could
probably write an assembler macro to do the encoding (though if the
offsets in question don't all have the same length when encoded as
sleb128, that complicates the macro a bit).

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
> appropriate comments saying what operations are being encoded).  You could
> probably write an assembler macro to do the encoding (though if the
> offsets in question don't all have the same length when encoded as
> sleb128, that complicates the macro a bit).

Where to put that macro, in sysdeps/x86_64/sysdep.h?


--
WBR,
Andrew
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
On Tue, 19 May 2015, Andrew Senkevich wrote:

> > appropriate comments saying what operations are being encoded).  You could
> > probably write an assembler macro to do the encoding (though if the
> > offsets in question don't all have the same length when encoded as
> > sleb128, that complicates the macro a bit).
>
> Where to put that macro, in sysdeps/x86_64/sysdep.h?

That might be plausible, depending on what the macro looks like (whether
it's something generic for any x86_64 code saving a register on the stack
when CFA is something else, or more specific to this libmvec code).

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
Here is fixed patch and ChangeLog.

2015-05-20  Andrew Senkevich  <[hidden email]>

        * sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
        * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
        * sysdeps/x86_64/fpu/Makefile: New file.
        * sysdeps/x86_64/fpu/Versions: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-support): Added build
        of SSE and AVX512 versions which are IFUNC.
        * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
        * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
        * sysdeps/x86_64/configure.ac: Options for libmvec build.
        * sysdeps/x86_64/configure: Regenerated.
        * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
        * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.
        * elf/Makefile (localplt-built-dso): libmvec added to localplt testing.
        * include/libc-symbols.h: Added libmvec_hidden_* macro series.

diff --git a/elf/Makefile b/elf/Makefile
index 34450ea..b06e0a7 100644
--- a/elf/Makefile
+++ b/elf/Makefile
@@ -990,6 +990,9 @@ localplt-built-dso := $(addprefix $(common-objpfx),\
   resolv/libresolv.so \
   crypt/libcrypt.so \
        )
+ifeq ($(build-mathvec),yes)
+localplt-built-dso += $(addprefix $(common-objpfx), mathvec/libmvec.so)
+endif
 ifeq ($(have-thread-library),yes)
 localplt-built-dso += $(filter-out %_nonshared.a, $(shared-thread-library))
 endif
diff --git a/include/libc-symbols.h b/include/libc-symbols.h
index ca3fe00..743b6f6 100644
--- a/include/libc-symbols.h
+++ b/include/libc-symbols.h
@@ -546,6 +546,26 @@ for linking")
 # define libm_hidden_data_ver(local, name)
 #endif

+#if IS_IN (libmvec)
+# define libmvec_hidden_proto(name, attrs...) hidden_proto (name, ##attrs)
+# define libmvec_hidden_tls_proto(name, attrs...) hidden_tls_proto
(name, ##attrs)
+# define libmvec_hidden_def(name) hidden_def (name)
+# define libmvec_hidden_weak(name) hidden_weak (name)
+# define libmvec_hidden_ver(local, name) hidden_ver (local, name)
+# define libmvec_hidden_data_def(name) hidden_data_def (name)
+# define libmvec_hidden_data_weak(name) hidden_data_weak (name)
+# define libmvec_hidden_data_ver(local, name) hidden_data_ver (local, name)
+#else
+# define libmvec_hidden_proto(name, attrs...)
+# define libmvec_hidden_tls_proto(name, attrs...)
+# define libmvec_hidden_def(name)
+# define libmvec_hidden_weak(name)
+# define libmvec_hidden_ver(local, name)
+# define libmvec_hidden_data_def(name)
+# define libmvec_hidden_data_weak(name)
+# define libmvec_hidden_data_ver(local, name)
+#endif
+
 #if IS_IN (libresolv)
 # define libresolv_hidden_proto(name, attrs...) hidden_proto (name, ##attrs)
 # define libresolv_hidden_tls_proto(name, attrs...) \
diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h
index e8e5577..85a6a95 100644
--- a/math/bits/mathcalls.h
+++ b/math/bits/mathcalls.h
@@ -60,7 +60,7 @@ __MATHCALL (atan,, (_Mdouble_ __x));
 __MATHCALL (atan2,, (_Mdouble_ __y, _Mdouble_ __x));

 /* Cosine of X.  */
-__MATHCALL (cos,, (_Mdouble_ __x));
+__MATHCALL_VEC (cos,, (_Mdouble_ __x));
 /* Sine of X.  */
 __MATHCALL (sin,, (_Mdouble_ __x));
 /* Tangent of X.  */
diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
new file mode 100644
index 0000000..be6eaed
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist
@@ -0,0 +1,6 @@
+GLIBC_2.22
+ GLIBC_2.22 A
+ _ZGVbN2v_cos F
+ _ZGVcN4v_cos F
+ _ZGVdN4v_cos F
+ _ZGVeN8v_cos F
diff --git a/sysdeps/x86/fpu/bits/math-vector.h
b/sysdeps/x86/fpu/bits/math-vector.h
new file mode 100644
index 0000000..27294ce
--- /dev/null
+++ b/sysdeps/x86/fpu/bits/math-vector.h
@@ -0,0 +1,34 @@
+/* Platform-specific SIMD declarations of math functions.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _MATH_H
+# error "Never include <bits/math-vector.h> directly;\
+ include <math.h> instead."
+#endif
+
+/* Get default empty definitions for simd declarations.  */
+#include <bits/libm-simd-decl-stubs.h>
+
+#if defined __x86_64__ && defined __FAST_MATH__
+# if defined _OPENMP && _OPENMP >= 201307
+/* OpenMP case.  */
+#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
+#  undef __DECL_SIMD_cos
+#  define __DECL_SIMD_cos __DECL_SIMD_x86_64
+# endif
+#endif
diff --git a/sysdeps/x86_64/configure b/sysdeps/x86_64/configure
index 7d4dadd..1493523 100644
--- a/sysdeps/x86_64/configure
+++ b/sysdeps/x86_64/configure
@@ -275,6 +275,10 @@ fi
 config_vars="$config_vars
 config-cflags-avx2 = $libc_cv_cc_avx2"

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h

 # work around problem with autoconf and empty lines at the end of files
diff --git a/sysdeps/x86_64/configure.ac b/sysdeps/x86_64/configure.ac
index c9f9a51..1c2b35f 100644
--- a/sysdeps/x86_64/configure.ac
+++ b/sysdeps/x86_64/configure.ac
@@ -99,6 +99,10 @@ if test $libc_cv_cc_avx2 = yes; then
 fi
 LIBC_CONFIG_VAR([config-cflags-avx2], [$libc_cv_cc_avx2])

+if test x"$build_mathvec" = xnotset; then
+  build_mathvec=yes
+fi
+
 dnl It is always possible to access static and hidden symbols in an
 dnl position independent way.
 AC_DEFINE(PI_STATIC_AND_HIDDEN)
diff --git a/sysdeps/x86_64/fpu/Makefile b/sysdeps/x86_64/fpu/Makefile
new file mode 100644
index 0000000..9cbf68b
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Makefile
@@ -0,0 +1,5 @@
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core_sse svml_d_cos4_core_avx \
+   svml_d_cos4_core_avx2 svml_d_cos8_core_avx512 \
+   svml_d_cos_data init-arch
+endif
diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions
new file mode 100644
index 0000000..b38ed07
--- /dev/null
+++ b/sysdeps/x86_64/fpu/Versions
@@ -0,0 +1,8 @@
+libmvec {
+  GLIBC_2.22 {
+    _ZGVbN2v_cos;
+    _ZGVcN4v_cos;
+    _ZGVdN4v_cos;
+    _ZGVeN8v_cos;
+  }
+}
diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile
b/sysdeps/x86_64/fpu/multiarch/Makefile
index 12b0526..5ccf97b 100644
--- a/sysdeps/x86_64/fpu/multiarch/Makefile
+++ b/sysdeps/x86_64/fpu/multiarch/Makefile
@@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
 CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
 endif
 endif
+
+ifeq ($(subdir),mathvec)
+libmvec-support += svml_d_cos2_core svml_d_cos8_core
+endif
diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
new file mode 100644
index 0000000..dcf0925
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S
@@ -0,0 +1,35 @@
+/* Multiple versions of vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+
+ .text
+ENTRY (_ZGVbN2v_cos)
+        .type   _ZGVbN2v_cos, @gnu_indirect_function
+        cmpl    $0, KIND_OFFSET+__cpu_features(%rip)
+        jne     1f
+        call    __init_cpu_features
+1:      leaq    _ZGVbN2v_cos_sse4(%rip), %rax
+        testl   $bit_SSE4_1, __cpu_features+CPUID_OFFSET+index_SSE4_1(%rip)
+        jz      2f
+        ret
+2:      leaq    _ZGVbN2v_cos_sse2(%rip), %rax
+        ret
+END (_ZGVbN2v_cos)
+libmvec_hidden_def (_ZGVbN2v_cos)
diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S
b/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S
new file mode 100644
index 0000000..27314f9
--- /dev/null
+++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S
@@ -0,0 +1,36 @@
+/* Multiple versions of vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+
+ .text
+ENTRY (_ZGVeN8v_cos)
+        .type   _ZGVeN8v_cos, @gnu_indirect_function
+        cmpl    $0, KIND_OFFSET+__cpu_features(%rip)
+        jne     1
+        call    __init_cpu_features
+1:      leaq    _ZGVeN8v_cos_skx(%rip), %rax
+        testl   $bit_AVX512DQ_Usable,
__cpu_features+FEATURE_OFFSET+index_AVX512DQ_Usable(%rip)
+        jnz     3
+2:      leaq    _ZGVeN8v_cos_knl(%rip), %rax
+        testl   $bit_AVX512F_Usable,
__cpu_features+FEATURE_OFFSET+index_AVX512F_Usable(%rip)
+        jnz     3
+        leaq    _ZGVeN8v_cos_avx2_wrapper(%rip), %rax
+3:      ret
+END (_ZGVeN8v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
new file mode 100644
index 0000000..658fe68
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos2_core_sse.S
@@ -0,0 +1,228 @@
+/* Function cos vectorized with SSE2 and SSE4.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+#include "svml_d_wrapper_impl.h"
+
+ .text
+ENTRY (_ZGVbN2v_cos_sse2)
+WRAPPER_IMPL_SSE2 cos
+END (_ZGVbN2v_cos_sse2)
+
+ENTRY (_ZGVbN2v_cos_sse4)
+/* ALGORITHM DESCRIPTION:
+
+        ( low accuracy ( < 4ulp ) or enhanced performance
+         ( half of correct mantissa ) implementation )
+
+        Argument representation:
+        arg + Pi/2 = (N*Pi + R)
+
+        Result calculation:
+        cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+        sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $320, %rsp
+        movaps    %xmm0, %xmm3
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        movups    __dHalfPI(%rax), %xmm2
+
+/* ARGUMENT RANGE REDUCTION:
+   Add Pi/2 to argument: X' = X+Pi/2
+ */
+        addpd     %xmm3, %xmm2
+        movups    __dInvPI(%rax), %xmm5
+        movups    __dAbsMask(%rax), %xmm4
+
+/* Get absolute argument value: X' = |X'| */
+        andps     %xmm2, %xmm4
+
+/* Y = X'*InvPi + RS : right shifter add */
+        mulpd     %xmm5, %xmm2
+
+/* Check for large arguments path */
+        cmpnlepd  __dRangeVal(%rax), %xmm4
+        movups    __dRShifter(%rax), %xmm6
+        addpd     %xmm6, %xmm2
+        movmskpd  %xmm4, %ecx
+
+/* N = Y - RS : right shifter sub */
+        movaps    %xmm2, %xmm1
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        psllq     $63, %xmm2
+        subpd     %xmm6, %xmm1
+
+/* N = N - 0.5 */
+        subpd     __dOneHalf(%rax), %xmm1
+        movups    __dPI1(%rax), %xmm7
+
+/* R = X - N*Pi1 */
+        mulpd     %xmm1, %xmm7
+        movups    __dPI2(%rax), %xmm4
+
+/* R = R - N*Pi2 */
+        mulpd     %xmm1, %xmm4
+        subpd     %xmm7, %xmm0
+        movups    __dPI3(%rax), %xmm5
+
+/* R = R - N*Pi3 */
+        mulpd     %xmm1, %xmm5
+        subpd     %xmm4, %xmm0
+
+/* R = R - N*Pi4 */
+        movups     __dPI4(%rax), %xmm6
+        mulpd     %xmm6, %xmm1
+        subpd     %xmm5, %xmm0
+        subpd     %xmm1, %xmm0
+
+/* POLYNOMIAL APPROXIMATION: R2 = R*R */
+        movaps    %xmm0, %xmm4
+        mulpd     %xmm0, %xmm4
+        movups    __dC7(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC6(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC5(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC4(%rax), %xmm1
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC3(%rax), %xmm1
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        mulpd     %xmm4, %xmm1
+        addpd     __dC2(%rax), %xmm1
+        mulpd     %xmm4, %xmm1
+        addpd     __dC1(%rax), %xmm1
+        mulpd     %xmm1, %xmm4
+        mulpd     %xmm0, %xmm4
+        addpd     %xmm4, %xmm0
+
+/* RECONSTRUCTION:
+   Final sign setting: Res = Poly^SignRes */
+        xorps     %xmm2, %xmm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        cfi_remember_state
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_1_3:
+        cfi_restore_state
+        movups    %xmm3, 192(%rsp)
+        movups    %xmm0, 256(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        movups    %xmm8, 112(%rsp)
+        movups    %xmm9, 96(%rsp)
+        movups    %xmm10, 80(%rsp)
+        movups    %xmm11, 64(%rsp)
+        movups    %xmm12, 48(%rsp)
+        movups    %xmm13, 32(%rsp)
+        movups    %xmm14, 16(%rsp)
+        movups    %xmm15, (%rsp)
+        movq      %rsi, 136(%rsp)
+        movq      %rdi, 128(%rsp)
+        movq      %r12, 168(%rsp)
+        cfi_offset_rel_rsp (12, 168)
+        movb      %dl, %r12b
+        movq      %r13, 160(%rsp)
+        cfi_offset_rel_rsp (13, 160)
+        movl      %ecx, %r13d
+        movq      %r14, 152(%rsp)
+        cfi_offset_rel_rsp (14, 152)
+        movl      %eax, %r14d
+        movq      %r15, 144(%rsp)
+        cfi_offset_rel_rsp (15, 144)
+        cfi_remember_state
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        movups    112(%rsp), %xmm8
+        movups    96(%rsp), %xmm9
+        movups    80(%rsp), %xmm10
+        movups    64(%rsp), %xmm11
+        movups    48(%rsp), %xmm12
+        movups    32(%rsp), %xmm13
+        movups    16(%rsp), %xmm14
+        movups    (%rsp), %xmm15
+        movq      136(%rsp), %rsi
+        movq      128(%rsp), %rdi
+        movq      168(%rsp), %r12
+        cfi_restore (%r12)
+        movq      160(%rsp), %r13
+        cfi_restore (%r13)
+        movq      152(%rsp), %r14
+        cfi_restore (%r14)
+        movq      144(%rsp), %r15
+        cfi_restore (%r15)
+        movups    256(%rsp), %xmm0
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     200(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 264(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        movsd     192(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        movsd     %xmm0, 256(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END (_ZGVbN2v_cos_sse4)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
new file mode 100644
index 0000000..bf10b01
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S
@@ -0,0 +1,25 @@
+/* Function cos vectorized in AVX ISA as wrapper to SSE4 ISA version.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_wrapper_impl.h"
+
+ .text
+ENTRY (_ZGVcN4v_cos)
+WRAPPER_IMPL_AVX _ZGVbN2v_cos
+END (_ZGVcN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
new file mode 100644
index 0000000..ec8548f
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos4_core_avx2.S
@@ -0,0 +1,208 @@
+/* Function cos vectorized with AVX2.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+
+ .text
+ENTRY (_ZGVdN4v_cos)
+
+/* ALGORITHM DESCRIPTION:
+
+      ( low accuracy ( < 4ulp ) or enhanced performance
+       ( half of correct mantissa ) implementation )
+
+      Argument representation:
+      arg + Pi/2 = (N*Pi + R)
+
+      Result calculation:
+      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $448, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+        vmovapd   %ymm0, %ymm1
+        vmovupd __dInvPI(%rax), %ymm4
+        vmovupd __dRShifter(%rax), %ymm5
+
+/*
+   ARGUMENT RANGE REDUCTION:
+   Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %ymm1, %ymm7
+
+/* Get absolute argument value: X' = |X'| */
+        vandpd __dAbsMask(%rax), %ymm7, %ymm2
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd %ymm5, %ymm4, %ymm7
+        vmovupd __dC7(%rax), %ymm4
+
+/* Check for large arguments path */
+        vcmpnle_uqpd __dRangeVal(%rax), %ymm2, %ymm3
+
+/* N = Y - RS : right shifter sub */
+        vsubpd    %ymm5, %ymm7, %ymm6
+        vmovupd __dPI1_FMA(%rax), %ymm2
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %ymm7, %ymm7
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %ymm6, %ymm0
+        vmovmskpd %ymm3, %ecx
+
+/* R = X - N*Pi1 */
+        vmovapd   %ymm1, %ymm3
+        vfnmadd231pd %ymm0, %ymm2, %ymm3
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %ymm0, %ymm3
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %ymm3, %ymm0
+
+/* POLYNOMIAL APPROXIMATION: R2 = R*R */
+        vmulpd    %ymm0, %ymm0, %ymm5
+        vfmadd213pd __dC6(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC5(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC4(%rax), %ymm5, %ymm4
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %ymm5, %ymm4
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %ymm5, %ymm4
+        vfmadd213pd __dC1(%rax), %ymm5, %ymm4
+        vmulpd    %ymm5, %ymm4, %ymm6
+        vfmadd213pd %ymm0, %ymm0, %ymm6
+
+/*
+   RECONSTRUCTION:
+   Final sign setting: Res = Poly^SignRes */
+        vxorpd    %ymm7, %ymm6, %ymm0
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        cfi_remember_state
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_1_3:
+        cfi_restore_state
+        vmovupd   %ymm1, 320(%rsp)
+        vmovupd   %ymm0, 384(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        vmovups   %ymm8, 224(%rsp)
+        vmovups   %ymm9, 192(%rsp)
+        vmovups   %ymm10, 160(%rsp)
+        vmovups   %ymm11, 128(%rsp)
+        vmovups   %ymm12, 96(%rsp)
+        vmovups   %ymm13, 64(%rsp)
+        vmovups   %ymm14, 32(%rsp)
+        vmovups   %ymm15, (%rsp)
+        movq      %rsi, 264(%rsp)
+        movq      %rdi, 256(%rsp)
+        movq      %r12, 296(%rsp)
+        cfi_offset_rel_rsp (12, 296)
+        movb      %dl, %r12b
+        movq      %r13, 288(%rsp)
+        cfi_offset_rel_rsp (13, 288)
+        movl      %ecx, %r13d
+        movq      %r14, 280(%rsp)
+        cfi_offset_rel_rsp (14, 280)
+        movl      %eax, %r14d
+        movq      %r15, 272(%rsp)
+        cfi_offset_rel_rsp (15, 272)
+        cfi_remember_state
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        vmovups   224(%rsp), %ymm8
+        vmovups   192(%rsp), %ymm9
+        vmovups   160(%rsp), %ymm10
+        vmovups   128(%rsp), %ymm11
+        vmovups   96(%rsp), %ymm12
+        vmovups   64(%rsp), %ymm13
+        vmovups   32(%rsp), %ymm14
+        vmovups   (%rsp), %ymm15
+        vmovupd   384(%rsp), %ymm0
+        movq      264(%rsp), %rsi
+        movq      256(%rsp), %rdi
+        movq      296(%rsp), %r12
+        cfi_restore (%r12)
+        movq      288(%rsp), %r13
+        cfi_restore (%r13)
+        movq      280(%rsp), %r14
+        cfi_restore (%r14)
+        movq      272(%rsp), %r15
+        cfi_restore (%r15)
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    328(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 392(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    320(%rsp,%r15), %xmm0
+        vzeroupper
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 384(%rsp,%r15)
+        jmp       .LBL_1_7
+
+END (_ZGVdN4v_cos)
+libmvec_hidden_def (_ZGVdN4v_cos)
diff --git a/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
new file mode 100644
index 0000000..daa77a9
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos8_core_avx512.S
@@ -0,0 +1,467 @@
+/* Function cos vectorized with AVX-512, wrapper to AVX2, KNL and SKX versions.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include "svml_d_cos_data.h"
+#include "svml_d_wrapper_impl.h"
+
+ .text
+ENTRY (_ZGVeN8v_cos_avx2_wrapper)
+WRAPPER_IMPL_AVX512 _ZGVdN4v_cos
+END (_ZGVeN8v_cos_avx2_wrapper)
+
+ENTRY (_ZGVeN8v_cos_knl)
+#ifndef HAVE_AVX512_ASM_SUPPORT
+WRAPPER_IMPL_AVX512 _ZGVdN4v_cos
+#else
+/*
+  ALGORITHM DESCRIPTION:
+
+       ( low accuracy ( < 4ulp ) or enhanced performance
+        ( half of correct mantissa ) implementation )
+
+        Argument representation:
+        arg + Pi/2 = (N*Pi + R)
+
+        Result calculation:
+        cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+        sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $1280, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+
+/* R = X - N*Pi1 */
+        vmovaps   %zmm0, %zmm7
+
+/* Check for large arguments path */
+        movq      $-1, %rcx
+
+/*
+  ARGUMENT RANGE REDUCTION:
+  Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %zmm0, %zmm5
+        vmovups __dInvPI(%rax), %zmm3
+
+/* Get absolute argument value: X' = |X'| */
+        vpandq __dAbsMask(%rax), %zmm5, %zmm1
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd __dRShifter(%rax), %zmm3, %zmm5
+        vmovups __dPI1_FMA(%rax), %zmm6
+
+/* N = Y - RS : right shifter sub */
+        vsubpd __dRShifter(%rax), %zmm5, %zmm4
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %zmm5, %zmm12
+        vmovups __dC7(%rax), %zmm8
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %zmm4, %zmm10
+        vcmppd    $22, __dRangeVal(%rax), %zmm1, %k1
+        vpbroadcastq %rcx, %zmm2{%k1}{z}
+        vfnmadd231pd %zmm10, %zmm6, %zmm7
+        vptestmq  %zmm2, %zmm2, %k0
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %zmm10, %zmm7
+        kmovw     %k0, %ecx
+        movzbl    %cl, %ecx
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %zmm7, %zmm10
+
+/*
+  POLYNOMIAL APPROXIMATION:
+  R2 = R*R
+ */
+        vmulpd    %zmm10, %zmm10, %zmm9
+        vfmadd213pd __dC6(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC5(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC4(%rax), %zmm9, %zmm8
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %zmm9, %zmm8
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %zmm9, %zmm8
+        vfmadd213pd __dC1(%rax), %zmm9, %zmm8
+        vmulpd    %zmm9, %zmm8, %zmm11
+        vfmadd213pd %zmm10, %zmm10, %zmm11
+
+/*
+  RECONSTRUCTION:
+  Final sign setting: Res = Poly^SignRes
+ */
+        vpxorq    %zmm12, %zmm11, %zmm1
+        testl     %ecx, %ecx
+        jne       .LBL_1_3
+
+.LBL_1_2:
+        cfi_remember_state
+        vmovaps   %zmm1, %zmm0
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_1_3:
+        cfi_restore_state
+        vmovups   %zmm0, 1152(%rsp)
+        vmovups   %zmm1, 1216(%rsp)
+        je        .LBL_1_2
+
+        xorb      %dl, %dl
+        kmovw     %k4, 1048(%rsp)
+        xorl      %eax, %eax
+        kmovw     %k5, 1040(%rsp)
+        kmovw     %k6, 1032(%rsp)
+        kmovw     %k7, 1024(%rsp)
+        vmovups   %zmm16, 960(%rsp)
+        vmovups   %zmm17, 896(%rsp)
+        vmovups   %zmm18, 832(%rsp)
+        vmovups   %zmm19, 768(%rsp)
+        vmovups   %zmm20, 704(%rsp)
+        vmovups   %zmm21, 640(%rsp)
+        vmovups   %zmm22, 576(%rsp)
+        vmovups   %zmm23, 512(%rsp)
+        vmovups   %zmm24, 448(%rsp)
+        vmovups   %zmm25, 384(%rsp)
+        vmovups   %zmm26, 320(%rsp)
+        vmovups   %zmm27, 256(%rsp)
+        vmovups   %zmm28, 192(%rsp)
+        vmovups   %zmm29, 128(%rsp)
+        vmovups   %zmm30, 64(%rsp)
+        vmovups   %zmm31, (%rsp)
+        movq      %rsi, 1064(%rsp)
+        movq      %rdi, 1056(%rsp)
+        movq      %r12, 1096(%rsp)
+        cfi_offset_rel_rsp (12, 1096)
+        movb      %dl, %r12b
+        movq      %r13, 1088(%rsp)
+        cfi_offset_rel_rsp (13, 1088)
+        movl      %ecx, %r13d
+        movq      %r14, 1080(%rsp)
+        cfi_offset_rel_rsp (12, 1080)
+        movl      %eax, %r14d
+        movq      %r15, 1072(%rsp)
+        cfi_offset_rel_rsp (12, 1072)
+        cfi_remember_state
+
+.LBL_1_6:
+        btl       %r14d, %r13d
+        jc        .LBL_1_12
+
+.LBL_1_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_1_10
+
+.LBL_1_8:
+        addb      $1, %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_1_6
+
+        kmovw     1048(%rsp), %k4
+        movq      1064(%rsp), %rsi
+        kmovw     1040(%rsp), %k5
+        movq      1056(%rsp), %rdi
+        kmovw     1032(%rsp), %k6
+        movq      1096(%rsp), %r12
+        cfi_restore (%r12)
+        movq      1088(%rsp), %r13
+        cfi_restore (%r13)
+        kmovw     1024(%rsp), %k7
+        vmovups   960(%rsp), %zmm16
+        vmovups   896(%rsp), %zmm17
+        vmovups   832(%rsp), %zmm18
+        vmovups   768(%rsp), %zmm19
+        vmovups   704(%rsp), %zmm20
+        vmovups   640(%rsp), %zmm21
+        vmovups   576(%rsp), %zmm22
+        vmovups   512(%rsp), %zmm23
+        vmovups   448(%rsp), %zmm24
+        vmovups   384(%rsp), %zmm25
+        vmovups   320(%rsp), %zmm26
+        vmovups   256(%rsp), %zmm27
+        vmovups   192(%rsp), %zmm28
+        vmovups   128(%rsp), %zmm29
+        vmovups   64(%rsp), %zmm30
+        vmovups   (%rsp), %zmm31
+        movq      1080(%rsp), %r14
+        cfi_restore (%r14)
+        movq      1072(%rsp), %r15
+        cfi_restore (%r15)
+        vmovups   1216(%rsp), %zmm1
+        jmp       .LBL_1_2
+
+.LBL_1_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1160(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1224(%rsp,%r15)
+        jmp       .LBL_1_8
+
+.LBL_1_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1152(%rsp,%r15), %xmm0
+        call      cos@PLT
+        vmovsd    %xmm0, 1216(%rsp,%r15)
+        jmp       .LBL_1_7
+#endif
+END (_ZGVeN8v_cos_knl)
+
+ENTRY (_ZGVeN8v_cos_skx)
+#ifndef HAVE_AVX512_ASM_SUPPORT
+WRAPPER_IMPL_AVX512 _ZGVdN4v_cos
+#else
+/*
+   ALGORITHM DESCRIPTION:
+
+      ( low accuracy ( < 4ulp ) or enhanced performance
+       ( half of correct mantissa ) implementation )
+
+      Argument representation:
+      arg + Pi/2 = (N*Pi + R)
+
+      Result calculation:
+      cos(arg) = sin(arg+Pi/2) = sin(N*Pi + R) = (-1)^N * sin(R)
+      sin(R) is approximated by corresponding polynomial
+ */
+        pushq     %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq      %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq      $-64, %rsp
+        subq      $1280, %rsp
+        movq      __svml_dcos_data@GOTPCREL(%rip), %rax
+
+/* R = X - N*Pi1 */
+        vmovaps   %zmm0, %zmm8
+
+/* Check for large arguments path */
+        vpbroadcastq .L_2il0floatpacket.16(%rip), %zmm2
+
+/*
+  ARGUMENT RANGE REDUCTION:
+  Add Pi/2 to argument: X' = X+Pi/2
+ */
+        vaddpd __dHalfPI(%rax), %zmm0, %zmm6
+        vmovups __dInvPI(%rax), %zmm3
+        vmovups __dRShifter(%rax), %zmm4
+        vmovups __dPI1_FMA(%rax), %zmm7
+        vmovups __dC7(%rax), %zmm9
+
+/* Get absolute argument value: X' = |X'| */
+        vandpd __dAbsMask(%rax), %zmm6, %zmm1
+
+/* Y = X'*InvPi + RS : right shifter add */
+        vfmadd213pd %zmm4, %zmm3, %zmm6
+        vcmppd    $18, __dRangeVal(%rax), %zmm1, %k1
+
+/* SignRes = Y<<63 : shift LSB to MSB place for result sign */
+        vpsllq    $63, %zmm6, %zmm13
+
+/* N = Y - RS : right shifter sub */
+        vsubpd    %zmm4, %zmm6, %zmm5
+
+/* N = N - 0.5 */
+        vsubpd __dOneHalf(%rax), %zmm5, %zmm11
+        vfnmadd231pd %zmm11, %zmm7, %zmm8
+
+/* R = R - N*Pi2 */
+        vfnmadd231pd __dPI2_FMA(%rax), %zmm11, %zmm8
+
+/* R = R - N*Pi3 */
+        vfnmadd132pd __dPI3_FMA(%rax), %zmm8, %zmm11
+
+/*
+  POLYNOMIAL APPROXIMATION:
+  R2 = R*R
+ */
+        vmulpd    %zmm11, %zmm11, %zmm10
+        vfmadd213pd __dC6(%rax), %zmm10, %zmm9
+        vfmadd213pd __dC5(%rax), %zmm10, %zmm9
+        vfmadd213pd __dC4(%rax), %zmm10, %zmm9
+
+/* Poly = C3+R2*(C4+R2*(C5+R2*(C6+R2*C7))) */
+        vfmadd213pd __dC3(%rax), %zmm10, %zmm9
+
+/* Poly = R+R*(R2*(C1+R2*(C2+R2*Poly))) */
+        vfmadd213pd __dC2(%rax), %zmm10, %zmm9
+        vfmadd213pd __dC1(%rax), %zmm10, %zmm9
+        vmulpd    %zmm10, %zmm9, %zmm12
+        vfmadd213pd %zmm11, %zmm11, %zmm12
+        vpandnq   %zmm1, %zmm1, %zmm2{%k1}
+        vcmppd    $3, %zmm2, %zmm2, %k0
+
+/*
+  RECONSTRUCTION:
+  Final sign setting: Res = Poly^SignRes
+ */
+        vxorpd    %zmm13, %zmm12, %zmm1
+        kmovw     %k0, %ecx
+        testl     %ecx, %ecx
+        jne       .LBL_2_3
+
+.LBL_2_2:
+        cfi_remember_state
+        vmovaps   %zmm1, %zmm0
+        movq      %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq      %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+
+.LBL_2_3:
+        cfi_restore_state
+        vmovups   %zmm0, 1152(%rsp)
+        vmovups   %zmm1, 1216(%rsp)
+        je        .LBL_2_2
+
+        xorb      %dl, %dl
+        xorl      %eax, %eax
+        kmovw     %k4, 1048(%rsp)
+        kmovw     %k5, 1040(%rsp)
+        kmovw     %k6, 1032(%rsp)
+        kmovw     %k7, 1024(%rsp)
+        vmovups   %zmm16, 960(%rsp)
+        vmovups   %zmm17, 896(%rsp)
+        vmovups   %zmm18, 832(%rsp)
+        vmovups   %zmm19, 768(%rsp)
+        vmovups   %zmm20, 704(%rsp)
+        vmovups   %zmm21, 640(%rsp)
+        vmovups   %zmm22, 576(%rsp)
+        vmovups   %zmm23, 512(%rsp)
+        vmovups   %zmm24, 448(%rsp)
+        vmovups   %zmm25, 384(%rsp)
+        vmovups   %zmm26, 320(%rsp)
+        vmovups   %zmm27, 256(%rsp)
+        vmovups   %zmm28, 192(%rsp)
+        vmovups   %zmm29, 128(%rsp)
+        vmovups   %zmm30, 64(%rsp)
+        vmovups   %zmm31, (%rsp)
+        movq      %rsi, 1064(%rsp)
+        movq      %rdi, 1056(%rsp)
+        movq      %r12, 1096(%rsp)
+        cfi_offset_rel_rsp (12, 1096)
+        movb      %dl, %r12b
+        movq      %r13, 1088(%rsp)
+        cfi_offset_rel_rsp (13, 1088)
+        movl      %ecx, %r13d
+        movq      %r14, 1080(%rsp)
+        cfi_offset_rel_rsp (14, 1080)
+        movl      %eax, %r14d
+        movq      %r15, 1072(%rsp)
+        cfi_offset_rel_rsp (15, 1072)
+        cfi_remember_state
+
+.LBL_2_6:
+        btl       %r14d, %r13d
+        jc        .LBL_2_12
+
+.LBL_2_7:
+        lea       1(%r14), %esi
+        btl       %esi, %r13d
+        jc        .LBL_2_10
+
+.LBL_2_8:
+        incb      %r12b
+        addl      $2, %r14d
+        cmpb      $16, %r12b
+        jb        .LBL_2_6
+
+        kmovw     1048(%rsp), %k4
+        kmovw     1040(%rsp), %k5
+        kmovw     1032(%rsp), %k6
+        kmovw     1024(%rsp), %k7
+        vmovups   960(%rsp), %zmm16
+        vmovups   896(%rsp), %zmm17
+        vmovups   832(%rsp), %zmm18
+        vmovups   768(%rsp), %zmm19
+        vmovups   704(%rsp), %zmm20
+        vmovups   640(%rsp), %zmm21
+        vmovups   576(%rsp), %zmm22
+        vmovups   512(%rsp), %zmm23
+        vmovups   448(%rsp), %zmm24
+        vmovups   384(%rsp), %zmm25
+        vmovups   320(%rsp), %zmm26
+        vmovups   256(%rsp), %zmm27
+        vmovups   192(%rsp), %zmm28
+        vmovups   128(%rsp), %zmm29
+        vmovups   64(%rsp), %zmm30
+        vmovups   (%rsp), %zmm31
+        vmovups   1216(%rsp), %zmm1
+        movq      1064(%rsp), %rsi
+        movq      1056(%rsp), %rdi
+        movq      1096(%rsp), %r12
+        cfi_restore (%r12)
+        movq      1088(%rsp), %r13
+        cfi_restore (%r13)
+        movq      1080(%rsp), %r14
+        cfi_restore (%r14)
+        movq      1072(%rsp), %r15
+        cfi_restore (%r15)
+        jmp       .LBL_2_2
+
+.LBL_2_10:
+        cfi_restore_state
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1160(%rsp,%r15), %xmm0
+        vzeroupper
+        vmovsd    1160(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 1224(%rsp,%r15)
+        jmp       .LBL_2_8
+
+.LBL_2_12:
+        movzbl    %r12b, %r15d
+        shlq      $4, %r15
+        vmovsd    1152(%rsp,%r15), %xmm0
+        vzeroupper
+        vmovsd    1152(%rsp,%r15), %xmm0
+
+        call      cos@PLT
+
+        vmovsd    %xmm0, 1216(%rsp,%r15)
+        jmp       .LBL_2_7
+#endif
+END (_ZGVeN8v_cos_skx)
+
+ .section .rodata, "a"
+.L_2il0floatpacket.16:
+ .long 0xffffffff,0xffffffff
+ .type .L_2il0floatpacket.16,@object
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.S
b/sysdeps/x86_64/fpu/svml_d_cos_data.S
new file mode 100644
index 0000000..c9bfd63
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.S
@@ -0,0 +1,114 @@
+/* Data for vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "svml_d_cos_data.h"
+
+.macro double_vector offset value
+.if .-__svml_dcos_data != \offset
+.err
+.endif
+.rept 8
+.quad \value
+.endr
+.endm
+
+ .section .rodata, "a"
+ .align 64
+
+/* Data table for vector implementations of function cos.
+   The table may contain polynomial, reduction, lookup
+   coefficients and other constants obtained through different
+   methods of research and experimental work.
+ */
+ .globl __svml_dcos_data
+__svml_dcos_data:
+
+/* General purpose constants:
+   absolute value mask
+ */
+double_vector __dAbsMask 0x7fffffffffffffff
+
+/* working range threshold */
+double_vector __dRangeVal 0x4160000000000000
+
+/* PI/2 */
+double_vector __dHalfPI 0x3ff921fb54442d18
+
+/* 1/PI */
+double_vector __dInvPI 0x3fd45f306dc9c883
+
+/* right-shifter constant */
+double_vector __dRShifter 0x4338000000000000
+
+/* 0.5 */
+double_vector __dOneHalf 0x3fe0000000000000
+
+/* Range reduction PI-based constants:
+   PI high part
+ */
+double_vector __dPI1 0x400921fb40000000
+
+/* PI mid  part 1 */
+double_vector __dPI2 0x3e84442d00000000
+
+/* PI mid  part 2 */
+double_vector __dPI3 0x3d08469880000000
+
+/* PI low  part */
+double_vector __dPI4 0x3b88cc51701b839a
+
+/* Range reduction PI-based constants if FMA available:
+   PI high part (FMA available)
+ */
+double_vector __dPI1_FMA 0x400921fb54442d18
+
+/* PI mid part  (FMA available) */
+double_vector __dPI2_FMA 0x3ca1a62633145c06
+
+/* PI low part  (FMA available) */
+double_vector __dPI3_FMA 0x395c1cd129024e09
+
+/* Polynomial coefficients (relative error 2^(-52.115)): */
+double_vector __dC1 0xbfc55555555554a7
+double_vector __dC2 0x3f8111111110a4a8
+double_vector __dC3 0xbf2a01a019a5b86d
+double_vector __dC4 0x3ec71de38030fea0
+double_vector __dC5 0xbe5ae63546002231
+double_vector __dC6 0x3de60e6857a2f220
+double_vector __dC7 0xbd69f0d60811aac8
+
+/*
+   Additional constants:
+   absolute value mask
+ */
+double_vector __dAbsMask_la 0x7fffffffffffffff
+
+/* 1/PI */
+double_vector __dInvPI_la 0x3fd45f306dc9c883
+
+/* right-shifer for low accuracy version */
+double_vector __dRShifter_la 0x4330000000000000
+
+/* right-shifer-1.0 for low accuracy version */
+double_vector __dRShifterm5_la 0x432fffffffffffff
+
+/* right-shifer with low mask for low accuracy version */
+double_vector __dRXmax_la 0x43300000007ffffe
+
+ .type __svml_dcos_data,@object
+ .size __svml_dcos_data,.-__svml_dcos_data
diff --git a/sysdeps/x86_64/fpu/svml_d_cos_data.h
b/sysdeps/x86_64/fpu/svml_d_cos_data.h
new file mode 100644
index 0000000..4d28e6e
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_cos_data.h
@@ -0,0 +1,48 @@
+/* Offsets for data table for vectorized cos.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef D_COS_DATA_H
+#define D_COS_DATA_H
+
+#define __dAbsMask              0
+#define __dRangeVal             64
+#define __dHalfPI               128
+#define __dInvPI                192
+#define __dRShifter             256
+#define __dOneHalf              320
+#define __dPI1                  384
+#define __dPI2                  448
+#define __dPI3                  512
+#define __dPI4                  576
+#define __dPI1_FMA              640
+#define __dPI2_FMA              704
+#define __dPI3_FMA              768
+#define __dC1                   832
+#define __dC2                   896
+#define __dC3                   960
+#define __dC4                   1024
+#define __dC5                   1088
+#define __dC6                   1152
+#define __dC7                   1216
+#define __dAbsMask_la           1280
+#define __dInvPI_la             1344
+#define __dRShifter_la          1408
+#define __dRShifterm5_la        1472
+#define __dRXmax_la             1536
+
+#endif
diff --git a/sysdeps/x86_64/fpu/svml_d_wrapper_impl.h
b/sysdeps/x86_64/fpu/svml_d_wrapper_impl.h
new file mode 100644
index 0000000..4da5e61
--- /dev/null
+++ b/sysdeps/x86_64/fpu/svml_d_wrapper_impl.h
@@ -0,0 +1,101 @@
+/* Wrapper implementations of several versions of vector math functions.
+   Copyright (C) 2014-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* SSE2 ISA version as wrapper to scalar.  */
+.macro WRAPPER_IMPL_SSE2 callee
+        subq      $40, %rsp
+        cfi_adjust_cfa_offset(40)
+        movaps    %xmm0, (%rsp)
+        call      \callee@PLT
+        movsd     %xmm0, 16(%rsp)
+        movsd     8(%rsp), %xmm0
+        call      \callee@PLT
+        movsd     16(%rsp), %xmm1
+        movsd     %xmm0, 24(%rsp)
+        unpcklpd  %xmm0, %xmm1
+        movaps    %xmm1, %xmm0
+        addq      $40, %rsp
+        cfi_adjust_cfa_offset(-40)
+        ret
+.endm
+
+/* AVX ISA version as wrapper to SSE4 ISA version.  */
+.macro WRAPPER_IMPL_AVX callee
+        pushq %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq $-32, %rsp
+        subq $32, %rsp
+        vextractf128 $1, %ymm0, (%rsp)
+        vzeroupper
+        call HIDDEN_JUMPTARGET(\callee)
+        vmovapd %xmm0, 16(%rsp)
+        vmovaps (%rsp), %xmm0
+        call HIDDEN_JUMPTARGET(\callee)
+        vmovapd %xmm0, %xmm1
+        vmovapd 16(%rsp), %xmm0
+        vinsertf128 $1, %xmm1, %ymm0, %ymm0
+        movq %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+.endm
+
+/* AVX512 ISA version as wrapper to AVX2 ISA version.  */
+.macro WRAPPER_IMPL_AVX512 callee
+        pushq %rbp
+        cfi_adjust_cfa_offset (8)
+        cfi_rel_offset (%rbp, 0)
+        movq %rsp, %rbp
+        cfi_def_cfa_register (%rbp)
+        andq $-64, %rsp
+        subq $64, %rsp
+/* Below is encoding for vmovaps %zmm0, (%rsp).  */
+        .byte 0x62
+        .byte 0xf1
+        .byte 0x7c
+        .byte 0x48
+        .byte 0x29
+        .byte 0x04
+        .byte 0x24
+/* Below is encoding for vmovapd (%rsp), %ymm0.  */
+        .byte 0xc5
+        .byte 0xfd
+        .byte 0x28
+        .byte 0x04
+        .byte 0x24
+        call HIDDEN_JUMPTARGET(\callee)
+/* Below is encoding for vmovapd 32(%rsp), %ymm0.  */
+        .byte 0xc5
+        .byte 0xfd
+        .byte 0x28
+        .byte 0x44
+        .byte 0x24
+        .byte 0x20
+        call HIDDEN_JUMPTARGET(\callee)
+        movq %rbp, %rsp
+        cfi_def_cfa_register (%rsp)
+        popq %rbp
+        cfi_adjust_cfa_offset (-8)
+        cfi_restore (%rbp)
+        ret
+.endm
diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h
index e652171..e79a397 100644
--- a/sysdeps/x86_64/sysdep.h
+++ b/sysdeps/x86_64/sysdep.h
@@ -25,6 +25,13 @@

 /* Syntactic details of assembler.  */

+/* This macro is for setting proper CFI with DW_CFA_expression describing
+   the register as saved relative to %rsp instead of relative to the CFA.
+   Expression is DW_OP_drop, DW_OP_breg7 (%rsp is register 7), sleb128 offset
+   from %rsp.  */
+#define cfi_offset_rel_rsp(regn, off) .cfi_escape 0x10, regn, 0x4, 0x13, \
+ 0x77, off & 0x7F | 0x80, off >> 7
+
 /* ELF uses byte-counts for .align, most others use log2 of count of bytes.  */
 #define ALIGNARG(log2) 1<<log2
 #define ASM_SIZE_DIRECTIVE(name) .size name,.-name;

Is it ok?


--
WBR,
Andrew

0001-x86-64-vector-cos.patch (63K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
Was found typos, attached with fix.


--
WBR,
Andrew

0001-x86-64-vector-cos.patch (63K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
On Thu, 21 May 2015, Andrew Senkevich wrote:

> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
> index 12b0526..5ccf97b 100644
> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>  CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>  endif
>  endif
> +
> +ifeq ($(subdir),mathvec)
> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
> +endif

As far as I can tell, if you use --disable-multi-arch, then these files
won't be built into libmvec, and nothing else will provide the
_ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
examples of how such things are handled - typically, the file outside the
multiarch directory implements things for an always-supported architecture
variant (in this case, that would be a variant guaranteed to be supported
if the given entry point gets called), then, in the multiarch directory,
there are the implementations for other variants, and a file with the same
name as that outside the multiarch directory, that (a) provides the IFUNC
resolver and (b) defines some macros before #including the file in the
directory above, so that the basic version of the function gets defined
under a different name.

The elf/Makefile and include/libc-symbols.h changes are OK on their own -
I think it's best for them to go in now rather than together with the
first function implementations.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
2015-05-22 18:31 GMT+03:00 Joseph Myers <[hidden email]>:
> The elf/Makefile and include/libc-symbols.h changes are OK on their own -
> I think it's best for them to go in now rather than together with the
> first function implementations.

Is it needed to post extracted part as separate patch?


--
WBR,
Andrew
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
On Fri, 22 May 2015, Andrew Senkevich wrote:

> 2015-05-22 18:31 GMT+03:00 Joseph Myers <[hidden email]>:
> > The elf/Makefile and include/libc-symbols.h changes are OK on their own -
> > I think it's best for them to go in now rather than together with the
> > first function implementations.
>
> Is it needed to post extracted part as separate patch?

I don't think so.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
In reply to this post by Joseph Myers
2015-05-22 18:31 GMT+03:00 Joseph Myers <[hidden email]>:

> On Thu, 21 May 2015, Andrew Senkevich wrote:
>
>> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
>> index 12b0526..5ccf97b 100644
>> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
>> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
>> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>>  CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>>  endif
>>  endif
>> +
>> +ifeq ($(subdir),mathvec)
>> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
>> +endif
>
> As far as I can tell, if you use --disable-multi-arch, then these files
> won't be built into libmvec, and nothing else will provide the
> _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
> examples of how such things are handled - typically, the file outside the
> multiarch directory implements things for an always-supported architecture
> variant (in this case, that would be a variant guaranteed to be supported
> if the given entry point gets called), then, in the multiarch directory,
> there are the implementations for other variants, and a file with the same
> name as that outside the multiarch directory, that (a) provides the IFUNC
> resolver and (b) defines some macros before #including the file in the
> directory above, so that the basic version of the function gets defined
> under a different name.
Hi, Joseph, updated patch is attached. Is it ok?

2015-05-25  Andrew Senkevich  <[hidden email]>

        * sysdeps/x86_64/fpu/Makefile: New file.
        * sysdeps/x86_64/fpu/Versions: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
        * sysdeps/x86_64/fpu/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
        * sysdeps/x86_64/fpu/svml_d_cos8_core.S: New file.
        * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
        * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: New file.
        * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
        build of SSE, AVX2 and AVX512 versions which are IFUNC.
        * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
        * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
        * sysdeps/x86_64/configure.ac: Options for libmvec build.
        * sysdeps/x86_64/configure: Regenerated.
        * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
        * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.


--
WBR,
Andrew

0001-x86-64-vector-cos.patch (68K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
One inconsistency was found in this patch (in _ZGVdN4v_cos
implementation), here is diff for fix:

diff --git a/0001-x86-64-vector-cos.patch b/0001-x86-64-vector-cos.patch_fixed
index b6d29fe..62fbc2b 100755
--- a/0001-x86-64-vector-cos.patch
+++ b/0001-x86-64-vector-cos.patch_fixed
@@ -435,7 +435,7 @@ index 0000000..b1725f7
 +        jne     1f
 +        call    __init_cpu_features
 +1:      leaq    _ZGVdN4v_cos_avx2(%rip), %rax
-+        testl   $bit_AVX2, __cpu_features+CPUID_OFFSET+index_AVX2(%rip)
++        testl   $bit_AVX2_Usable,
__cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip)
 +        jz      2f
 +        ret
 +2:      leaq    _ZGVdN4v_cos_sse_wrapper(%rip), %rax


--
WBR,
Andrew
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
In reply to this post by Andrew Senkevich
2015-05-25 21:26 GMT+03:00 Andrew Senkevich <[hidden email]>:

> 2015-05-22 18:31 GMT+03:00 Joseph Myers <[hidden email]>:
>> On Thu, 21 May 2015, Andrew Senkevich wrote:
>>
>>> diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> index 12b0526..5ccf97b 100644
>>> --- a/sysdeps/x86_64/fpu/multiarch/Makefile
>>> +++ b/sysdeps/x86_64/fpu/multiarch/Makefile
>>> @@ -51,3 +51,7 @@ CFLAGS-slowexp-avx.c = -msse2avx -DSSE2AVX
>>>  CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX
>>>  endif
>>>  endif
>>> +
>>> +ifeq ($(subdir),mathvec)
>>> +libmvec-support += svml_d_cos2_core svml_d_cos8_core
>>> +endif
>>
>> As far as I can tell, if you use --disable-multi-arch, then these files
>> won't be built into libmvec, and nothing else will provide the
>> _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
>> examples of how such things are handled - typically, the file outside the
>> multiarch directory implements things for an always-supported architecture
>> variant (in this case, that would be a variant guaranteed to be supported
>> if the given entry point gets called), then, in the multiarch directory,
>> there are the implementations for other variants, and a file with the same
>> name as that outside the multiarch directory, that (a) provides the IFUNC
>> resolver and (b) defines some macros before #including the file in the
>> directory above, so that the basic version of the function gets defined
>> under a different name.
>
> Hi, Joseph, updated patch is attached. Is it ok?
>
> 2015-05-25  Andrew Senkevich  <[hidden email]>
>
>         * sysdeps/x86_64/fpu/Makefile: New file.
>         * sysdeps/x86_64/fpu/Versions: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos2_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos4_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_cos8_core.S: New file.
>         * sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: New file.
>         * sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
>         build of SSE, AVX2 and AVX512 versions which are IFUNC.
>         * sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
>         * math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
>         * sysdeps/x86_64/configure.ac: Options for libmvec build.
>         * sysdeps/x86_64/configure: Regenerated.
>         * sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
>         * sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.

Ping.


--
WBR,
Andrew
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Joseph Myers
In reply to this post by Andrew Senkevich
On Mon, 25 May 2015, Andrew Senkevich wrote:

> > As far as I can tell, if you use --disable-multi-arch, then these files
> > won't be built into libmvec, and nothing else will provide the
> > _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
> > examples of how such things are handled - typically, the file outside the
> > multiarch directory implements things for an always-supported architecture
> > variant (in this case, that would be a variant guaranteed to be supported
> > if the given entry point gets called), then, in the multiarch directory,
> > there are the implementations for other variants, and a file with the same
> > name as that outside the multiarch directory, that (a) provides the IFUNC
> > resolver and (b) defines some macros before #including the file in the
> > directory above, so that the basic version of the function gets defined
> > under a different name.
>
> Hi, Joseph, updated patch is attached. Is it ok?

OK provided you've tested this both with and without --disable-multi-arch.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/14] [x86_64] Vector math functions (vector cos)

Andrew Senkevich
2015-06-04 19:50 GMT+03:00 Joseph Myers <[hidden email]>:

> On Mon, 25 May 2015, Andrew Senkevich wrote:
>
>> > As far as I can tell, if you use --disable-multi-arch, then these files
>> > won't be built into libmvec, and nothing else will provide the
>> > _ZGVbN2v_cos and _ZGVeN8v_cos symbols.  See other multiarch code for
>> > examples of how such things are handled - typically, the file outside the
>> > multiarch directory implements things for an always-supported architecture
>> > variant (in this case, that would be a variant guaranteed to be supported
>> > if the given entry point gets called), then, in the multiarch directory,
>> > there are the implementations for other variants, and a file with the same
>> > name as that outside the multiarch directory, that (a) provides the IFUNC
>> > resolver and (b) defines some macros before #including the file in the
>> > directory above, so that the basic version of the function gets defined
>> > under a different name.
>>
>> Hi, Joseph, updated patch is attached. Is it ok?
>
> OK provided you've tested this both with and without --disable-multi-arch.

Thank you for review, of course both cases were tested.


--
WBR,
Andrew
12