[Bug testsuite/23493] New: Test suite makes all CPU stuck forever on kernel 2.6.16 (Fedora 27)

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] New: Test suite makes all CPU stuck forever on kernel 2.6.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

            Bug ID: 23493
           Summary: Test suite makes all CPU stuck forever on kernel
                    2.6.16 (Fedora 27)
           Product: systemtap
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: testsuite
          Assignee: systemtap at sourceware dot org
          Reporter: agentzh at gmail dot com
  Target Milestone: ---

I tried running the tests for systemtap release 3.3 (both the tests and the
source code were from that release) on my Fedora 27 with kernel 2.6.16, but
after running the tests for one hour, all the 8 logical CPU cores stuck for
more than 8 hours without going any further and I had to reboot the VM with
force (even ssh cannot work).

Below is all the error messages after the stuck happens:

Aug  7 00:27:28 work2 kernel: Systemtap Error at _stp_mempool_init:50
Aug  7 00:27:28 work2 kernel: Memory allocation failed.
Aug  7 00:27:28 work2 kernel: Systemtap Error at _stp_register_ctl_channel:738
Aug  7 00:27:28 work2 kernel: Error creating systemtap control channel.
Aug  7 00:31:29 work2 dhclient[1149]: DHCPREQUEST on ens36 to 192.168.46.254
port 67 (xid=0x2772f551)
Aug  7 00:31:29 work2 dhclient[1149]: DHCPACK from 192.168.46.254
(xid=0x2772f551)
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6530] dhcp4
(ens36):   address 192.168.46.170
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4
(ens36):   plen 24 (255.255.255.0)
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4
(ens36):   lease time 1800
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4
(ens36):   nameserver '192.168.46.1'
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4
(ens36):   domain name 'localdomain'
Aug  7 00:31:29 work2 dbus-daemon[724]: [system] Activating via systemd:
service name='org.freedesktop.nm_dispatcher'
unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.5' (uid=0
pid=839 comm="/usr/sbin/NetworkManager --no-daemon "
label="system_u:system_r:NetworkManager_t:s0")
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4
(ens36): state changed bound -> bound
Aug  7 00:31:29 work2 dhclient[1149]: bound to 192.168.46.170 -- renewal in 683
seconds.
Aug  7 00:31:29 work2 systemd[1]: Starting Network Manager Script Dispatcher
Service...
Aug  7 00:31:29 work2 dbus-daemon[724]: [system] Successfully activated service
'org.freedesktop.nm_dispatcher'
Aug  7 00:31:29 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=NetworkManager-dispatcher comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:31:29 work2 systemd[1]: Started Network Manager Script Dispatcher
Service.
Aug  7 00:31:29 work2 nm-dispatcher[47712]: req:1 'dhcp4-change' [ens36]: new
request (6 scripts)
Aug  7 00:31:29 work2 nm-dispatcher[47712]: req:1 'dhcp4-change' [ens36]: start
running ordered scripts...
Aug  7 00:31:40 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=NetworkManager-dispatcher comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:33 work2 audit[59170]: ANOM_ABEND auid=1000 uid=0 gid=0 ses=3
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=59170
comm="test"
exe="/mnt/home/agentzh/git/stap-build/testsuite/artifacts/systemtap.base/pr18649/test"
sig=11 res=1
Aug  7 00:32:33 work2 kernel: test[59170]: segfault at 0 ip 0000000000000000 sp
00000000fff42e30 error 14 in test[8048000+1000]
Aug  7 00:32:33 work2 systemd[1]: Started Process Core Dump (PID 59186/UID 0).
Aug  7 00:32:33 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-coredump@5-59186-0 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:34 work2 systemd-coredump[59191]: Process 59170 (test) of user 0
dumped core.#012#012Stack trace of thread 59170:#012#0  0x0000000000000000 n/a
(n/a)
Aug  7 00:32:34 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-coredump@5-59186-0 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:45 work2 kernel: test[62368]: segfault at 0 ip 0000000000000000 sp
00000000ffb53270 error 14 in test[8048000+1000]
Aug  7 00:32:45 work2 audit[62368]: ANOM_ABEND auid=1000 uid=0 gid=0 ses=3
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=62368
comm="test"
exe="/mnt/home/agentzh/git/stap-build/testsuite/artifacts/systemtap.base/pr18649/test"
sig=11 res=1
Aug  7 00:32:45 work2 systemd[1]: Started Process Core Dump (PID 62441/UID 0).
Aug  7 00:32:45 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-coredump@6-62441-0 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:46 work2 systemd-coredump[62445]: Process 62368 (test) of user 0
dumped core.#012#012Stack trace of thread 62368:#012#0  0x0000000000000000 n/a
(n/a)
Aug  7 00:32:46 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-coredump@6-62441-0 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:51 work2 audit[64501]: ANOM_ABEND auid=1000 uid=0 gid=0 ses=3
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=64501
comm="test"
exe="/mnt/home/agentzh/git/stap-build/testsuite/artifacts/systemtap.base/uprobe_nd_params/test"
sig=11 res=1
Aug  7 00:32:51 work2 kernel: test[64501]: segfault at 0 ip 0000000000000000 sp
00000000ff877e2c error 14 in test[8048000+1000]
Aug  7 00:32:51 work2 systemd[1]: Started Process Core Dump (PID 64521/UID 0).
Aug  7 00:32:51 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-coredump@7-64521-0 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:52 work2 systemd-coredump[64529]: Process 64501 (test) of user 0
dumped core.#012#012Stack trace of thread 64501:#012#0  0x0000000000000000 n/a
(n/a)
Aug  7 00:32:52 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295
ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-coredump@7-64521-0 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:33:25 work2 stapio[65395]: WARNING:
Aug  7 00:33:25 work2 stapio[65395]:
Aug  7 00:33:25 work2 stapio[65395]: Number of errors: 0, skipped probes: 23
Aug  7 00:34:16 work2 kernel: stap_e0ad497a0eaf9de8589606c16f0b585_79159:
unknown parameter 'pt' ignored

And below is the last bunch of lines in the test run output.

Running target unix
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for
target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /mnt/home/agentzh/git/systemtap/testsuite/config/unix.exp as
tool-and-target-specific interface file.
Test run by root on Tue Aug  7 00:35:53 2018
Native configuration is x86_64-pc-linux-gnu

        === systemtap tests ===

Running
/mnt/home/agentzh/git/systemtap/testsuite/systemtap.base/kallsyms_expand_symbol.exp
...
Schedule of variations:
    unix

Running target unix
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for
target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /mnt/home/agentzh/git/systemtap/testsuite/config/unix.exp as
tool-and-target-specific interface file.
Running /mnt/home/agentzh/git/systemtap/testsuite/systemtap.base/perf.exp ...

        === systemtap Summary ===

# of expected passes        2
make[3]: Leaving directory '/mnt/home/agentzh/git/stap-build/testsuite'
if test -e artifacts/systemtap.base/sdt_varname ; then \
echo "Using existing test result for systemtap.base/sdt_varname.exp"; \
else \
mkdir -p artifacts/systemtap.base/sdt_varname ; \
MAKEFLAGS= make  check-DEJAGNU
RUNTESTFLAGS="--outdir=artifacts/systemtap.base/sdt_varname
systemtap.base/sdt_varname.exp  --tool_opts \'install \'" STAP_PARALLEL=yes; \
fi
make[3]: Entering directory '/mnt/home/agentzh/git/stap-build/testsuite'
srcdir='../../systemtap/testsuite'; export srcdir; \
EXPECT=expect; export EXPECT; \
if /bin/sh -c "env XDG_DATA_DIRS= SYSTEMTAP_SYNC=1 LANG=C
SYSTEMTAP_TESTREMOTES= SYSTEMTAP_TESTAPPS=
SYSTEMTAP_RUNTIME=/opt/stap/share/systemtap/runtime
SYSTEMTAP_TAPSET=/opt/stap/share/systemtap/tapset
LD_LIBRARY_PATH=/opt/stap/lib/systemtap:/opt/stap/lib/dyninst
CRASH_LIBDIR=/opt/stap/lib/systemtap PATH=/opt/stap/bin:$PATH
SYSTEMTAP_PATH=/opt/stap/bin SYSTEMTAP_INCLUDES=/opt/stap/include
PKGLIBDIR=/opt/stap/libexec/systemtap PYTHON=/bin/python2
PYEXECDIR=/opt/stap/lib64/python2.7/site-packages PYTHON3=/bin/python3
PY3EXECDIR=/opt/stap/lib64/python3.6/site-packages SYSCONFDIR=/opt/stap/etc
../../systemtap/testsuite/execrc runtest --version" > /dev/null 2>&1; then \
  exit_status=0; l='systemtap'; for tool in $l; do \
    if env XDG_DATA_DIRS= SYSTEMTAP_SYNC=1 LANG=C SYSTEMTAP_TESTREMOTES=
SYSTEMTAP_TESTAPPS= SYSTEMTAP_RUNTIME=/opt/stap/share/systemtap/runtime
SYSTEMTAP_TAPSET=/opt/stap/share/systemtap/tapset
LD_LIBRARY_PATH=/opt/stap/lib/systemtap:/opt/stap/lib/dyninst
CRASH_LIBDIR=/opt/stap/lib/systemtap PATH=/opt/stap/bin:$PATH
SYSTEMTAP_PATH=/opt/stap/bin SYSTEMTAP_INCLUDES=/opt/stap/include
PKGLIBDIR=/opt/stap/libexec/systemtap PYTHON=/bin/python2
PYEXECDIR=/opt/stap/lib64/python2.7/site-packages PYTHON3=/bin/python3
PY3EXECDIR=/opt/stap/lib64/python3.6/site-packages SYSCONFDIR=/opt/stap/etc
../../systemtap/testsuite/execrc runtest  --tool $tool --tool_opts \'\'
--srcdir /mnt/home/agentzh/git/systemtap/testsuite
--outdir=artifacts/systemtap.base/sdt_varname systemtap.base/sdt_varname.exp
--tool_opts \'install \'; \
    then :; else exit_status=1; fi; \
  done; \
else echo "WARNING: could not find 'env XDG_DATA_DIRS= SYSTEMTAP_SYNC=1 LANG=C
SYSTEMTAP_TESTREMOTES= SYSTEMTAP_TESTAPPS=
SYSTEMTAP_RUNTIME=/opt/stap/share/systemtap/runtime
SYSTEMTAP_TAPSET=/opt/stap/share/systemtap/tapset
LD_LIBRARY_PATH=/opt/stap/lib/systemtap:/opt/stap/lib/dyninst
CRASH_LIBDIR=/opt/stap/lib/systemtap PATH=/opt/stap/bin:$PATH
SYSTEMTAP_PATH=/opt/stap/bin SYSTEMTAP_INCLUDES=/opt/stap/include
PKGLIBDIR=/opt/stap/libexec/systemtap PYTHON=/bin/python2
PYEXECDIR=/opt/stap/lib64/python2.7/site-packages PYTHON3=/bin/python3
PY3EXECDIR=/opt/stap/lib64/python3.6/site-packages SYSCONFDIR=/opt/stap/etc
../../systemtap/testsuite/execrc runtest'" 1>&2; :;\
fi; \
exit $exit_status
WARNING: Couldn't find the global config file.
Checking for 64-bit support...
spawn -ignore SIGHUP gcc hello.c -g -m64 -lm -o hello-m64^M
x86_64 64-bit support available
Checking for 32-bit support...
spawn -ignore SIGHUP gcc hello.c -g -m32 -lm -o hello-m32^M
x86_64 32-bit support available
kernel location: /usr/lib/debug/lib/modules/4.16.16-200.fc27.x86_64/vmlinux
kernel version: 4.16.16-200.fc27.x86_64
systemtap location: /opt/stap/bin/stap
systemtap version: version 3.3/0.173, commit release-3.3-0-g48867d1cface
gcc location: /usr/bin/gcc
gcc version: gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
Test run by root on Tue Aug  7 00:35:57 2018
Native configuration is x86_64-pc-linux-gnu

        === systemtap tests ===

Schedule of variations:
    unix

Running target unix
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for
target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /mnt/home/agentzh/git/systemtap/testsuite/config/unix.exp as
tool-and-target-specific interface file.
Running
/mnt/home/agentzh/git/systemtap/testsuite/systemtap.base/sdt_varname.exp ...

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #1 from agentzh <agentzh at gmail dot com> ---
Sorry, the kernel version is 4.16.16:

$ uname -r
4.16.16-200.fc27.x86_64

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #2 from agentzh <agentzh at gmail dot com> ---
Created attachment 11167
  --> https://sourceware.org/bugzilla/attachment.cgi?id=11167&action=edit
stap-report output on the same box

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 2.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

agentzh <agentzh at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Test suite makes all CPU    |Test suite makes all CPU
                   |stuck forever on kernel     |stuck forever on kernel
                   |2.6.16 (Fedora 27)          |2.16.16 (Fedora 27)

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

agentzh <agentzh at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Test suite makes all CPU    |Test suite makes all CPU
                   |stuck forever on kernel     |stuck forever on kernel
                   |2.16.16 (Fedora 27)         |4.16.16 (Fedora 27)

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

Frank Ch. Eigler <fche at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fche at redhat dot com

--- Comment #3 from Frank Ch. Eigler <fche at redhat dot com> ---
Two more bits of data should help identify the problem:

- a segment of the dmesg from -before- the crash, in order to figure out which
scripts were running most recently or at the time

- the exact make-check type invocation you were using (a
parallel-installcheck?)

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #4 from agentzh <agentzh at gmail dot com> ---
Created attachment 11175
  --> https://sourceware.org/bugzilla/attachment.cgi?id=11175&action=edit
Full /var/log/messages for the full period of test running

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #5 from agentzh <agentzh at gmail dot com> ---
Created attachment 11176
  --> https://sourceware.org/bugzilla/attachment.cgi?id=11176&action=edit
Full output of the command `sudo time make -j8 installcheck-parallel |& tee
result.txt`

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #6 from agentzh <agentzh at gmail dot com> ---
The command I used to run the stap test suite was this:

```
sudo time make -j8 installcheck-parallel |& tee result.txt
```

I've attached the full `result.txt` produced that night to this PR for your
reference.

I've also attached the full /var/log/messages file content during the test run
period (also included messages right before I started running the stap test
suite and right after I rebooted the machine on the next day to avoid any
confusions).

Some observations from the result.txt and /var/log/messages files:

1. Aug 6 23:18 I started running the test with the command `sudo time make -j8
installcheck-parallel |& tee result.txt` (see result.txt)

2. Aug 7 00:35:57 or later: the stap test suite generates the last few lines of
output to stderr/stdout (see result.txt)

3. Aug  7 00:34:16 The kernel generated the last line of output to
/var/log/messages, which was "work2 kernel:
stap_e0ad497a0eaf9de8589606c16f0b585_79159: unknown parameter 'pt' ignored"
After that, no kernel dmesg output until I rebooted the machine with force.

4. Aug  7 08:28:03 I got up in the morning and rebooted the machine with force.

The virtual machine running the tests has 8GB of RAM:

```
$ free
              total        used        free      shared  buff/cache   available
Mem:        8143832      646244     1851164       98824     5646424     6997500
Swap:       2097148           0     2097148
```

It has 8 logical CPU cores (4 physical CPU cores with hyperthreading enabled).

Please let me know if you need any further information. Thanks!

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #7 from agentzh <agentzh at gmail dot com> ---
It seems like that the stuck usually happens after running
`uprobes_onthefly.exp` and/or `kprobes_onthefly.exp`. By checking the
`/proc/buddyinfo` output during the nearby time window, the fragmentation
becomes dramatically worse, for example:

```
$ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      2      1      1      0  
   1      1      3
Node 0, zone    DMA32     14     23     27     27     23     12     17     16  
   9      4    739
Node 0, zone   Normal    379   4113   3506   3803   2461    733    221      0  
   0      0      0

$ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      2      1      1      0  
   1      1      3
Node 0, zone    DMA32      5      5     10     20     15      5     16     16  
   9      4    700
Node 0, zone   Normal    344   1273   1414   1396   1047    473    143      0  
   0      0      0
```

Even though there's a lot of free memory in the system at the same times:

```
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          11700        4897        3510          76        3293        5702
Swap:          2047           0        2047
```

I'm thinking about enforcing VM compaction every few minutes during the test
suite runs as a work around...

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #8 from agentzh <agentzh at gmail dot com> ---
Got another VM lockup today and found the following output in the serial port
console:

```
Fedora 27 (Server Edition)
Kernel 4.16.16-200.fc27.x86_64 on an x86_64 (ttyS1)

Admin Console: …

work2 login: [199810.093706] BUG: unable to handle kernel NULL pointer
dereference at 0000000000000018
[199810.098304] IP: idr_find+0x0/0x10
[199810.099452] PGD 0 P4D 0
[199810.100545] Oops: 0000 [#1] SMP PTI
[199810.101868] Modules linked in:
stap_90d3fbe18ef964786450622403aff75f_53232(OE)
stap_cb1c82c1a520c7e28bc23ff0583b723_26015(OE)
stap_5546859fb0fb1fe0358665f08f8539_102168(OE)
stap_1699a9c06989cded3b2bfdebd62fd6e9_7693(OE) s_97966(OE) tun binfmt_misc
ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink
ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 fuse nf_nat_ipv4 nf_nat
nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter
ebtables ip6table_filter ip6_tables vmw_vsock_vmci_transport vsock sunrpc
snd_seq_midi snd_seq_midi_event snd_ens1371 coretemp crct10dif_pclmul
snd_ac97_codec ac97_bus btusb gameport crc32_pclmul btrtl
[199810.124418]  snd_rawmidi btbcm ghash_clmulni_intel snd_seq btintel
snd_seq_device bluetooth snd_pcm snd_timer snd ecdh_generic rfkill soundcore
intel_rapl_perf joydev vmw_balloon i2c_piix4 vmw_vmci shpchp xfs libcrc32c
vmwgfx drm_kms_helper ttm crc32c_intel drm mptspi serio_raw scsi_transport_spi
mptscsih e1000 mptbase ata_generic pata_acpi [last unloaded:
stap_c0bb131f68fd6d96fe20bf726877cbfd_51549]
[199810.135177] CPU: 2 PID: 53244 Comm: cat Tainted: G           OE  
4.16.16-200.fc27.x86_64 #1
[199810.137821] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 07/02/2015
[199810.141336] RIP: 0010:idr_find+0x0/0x10
[199810.142583] RSP: 0018:ffffbeddc3423d40 EFLAGS: 00010046
[199810.144310] RAX: 0000000000000000 RBX: ffffbeddc2191000 RCX:
00000000ffffffff
[199810.146450] RDX: 0000000000000000 RSI: 000000000000cffc RDI:
0000000000000008
[199810.148760] RBP: ffffbeddc21918a8 R08: 0000000000002710 R09:
ffff9e8127a01f00
[199810.150956] R10: 0000000000002710 R11: 0000000000000000 R12:
0000000000000001
[199810.153205] R13: 0000000000000000 R14: 0000000000000000 R15:
00000000ffffffff
[199810.155436] FS:  0000000000000000(0000) GS:ffff9e825fa80000(0000)
knlGS:0000000000000000
[199810.157981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[199810.212544] CR2: 0000000000000018 CR3: 000000024120a001 CR4:
00000000001606e0
[199810.215011] DR0: ffffffff8427ff88 DR1: 0000000000000000 DR2:
0000000000000000
[199810.217433] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000600
[199810.219853] Call Trace:
[199810.233902]  find_get_pid+0x38/0x50
[199810.235436]  function___global_pid2task__overload_0+0x7a/0x180
[stap_90d3fbe18ef964786450622403aff75f_53232]
[199810.239019]  ? probe_4960+0xad6/0x1760
[stap_90d3fbe18ef964786450622403aff75f_53232]
[199810.241607]  ? enter_real_tracepoint_probe_0+0x198/0x2d0
[stap_90d3fbe18ef964786450622403aff75f_53232]
[199810.245162]  ? __schedule+0x2d3/0x860
[199810.246718]  ? do_task_dead+0x3f/0x50
[199810.248254]  ? do_exit+0x73b/0xbb0
[199810.249684]  ? do_group_exit+0x3a/0xa0
[199810.251200]  ? SyS_exit_group+0x10/0x10
[199810.252711]  ? do_syscall_64+0x74/0x180
[199810.254238]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[199810.256126] Code: e7 44 89 6c 24 04 e8 00 fe ff ff eb af e8 09 b9 80 ff 66
0f 1f 84 00 00 00 00 00 8b 47 10 31 d2 48 29 c6 e9 c3 55 00 00 0f 1f 00 <8b> 47
10 48 29 c6 e9 a5 55 00 00 0f 1f 44 00 00 55 53 48 89 f5
[199810.262366] RIP: idr_find+0x0/0x10 RSP: ffffbeddc3423d40
[199810.264261] CR2: 0000000000000018
[199810.265642] ---[ end trace 4e992f315e802219 ]---
[199810.267325] Fixing recursive fault but reboot is needed!
```

Is it a bug in the stap runtime or in the kernel?

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #10 from agentzh <agentzh at gmail dot com> ---
(In reply to Mark Wielaard from comment #9)
>
> Do you have more context about which probe was hit when this happened?

Hi Mark, thanks for your reply. I tried hard to pinpoint that
stap_90d3fbe18ef964786450622403aff75f_53232 kernel module after rebooting the
system with force but it seems that this ko is long gone and there is no record
for this thing in the test suite log files either. Anyway, I'll provide the
full test suite output file in an attachment here.

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #11 from agentzh <agentzh at gmail dot com> ---
Created attachment 11215
  --> https://sourceware.org/bugzilla/attachment.cgi?id=11215&action=edit
Full output of 'sudo make -j4 installcheck-parallel' for the find_get_pid
kernel oops run

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

Mark Wielaard <mark at klomp dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mark at klomp dot org

--- Comment #9 from Mark Wielaard <mark at klomp dot org> ---
(In reply to agentzh from comment #8)
> Is it a bug in the stap runtime or in the kernel?

It looks like this is caused by calling the kernel find_get_pid () function
from the pid2task tapset function when in kernel mode, just after the current
task died? I am not sure find_get_pid should/can be called from a kernel
context. It might assume the current pid actually exists? Maybe current is NULL
at this point and we should check for that first?

Do you have more context about which probe was hit when this happened?

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #12 from Frank Ch. Eigler <fche at redhat dot com> ---
(In reply to Mark Wielaard from comment #9)
> It might assume the current pid actually exists? Maybe
> current is NULL at this point and we should check for that first?

current is nonzero (I think by definition), but
task_active_pid_ns(current) (called within find_get_pid->find_vpid)
looks like it was 0.

So when we get called for a schedule tracepoint, the invoking task
might just be a mostly-dead one which doesn't even have a pid any more.
I'm not sure why this should occur only under heavy load
(installcheck-parallel), vs. all the time.  Maybe task garbage collection
occurs more in the former case.

I'm tempted to put in a probe-prologue detection of this
(maybe via (current->flags & PF_EXITING)) and reject the probe
hit entirely.

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug testsuite/23493] Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23493

--- Comment #13 from agentzh <agentzh at gmail dot com> ---
fche, thanks for the suggestion! We will try doing such experiments on our side
when we have a chance.

--
You are receiving this mail because:
You are the assignee for the bug.