[Bug tapsets/25290] New: process(EXE).begin may occasionally miss already-running target processes from EXE

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug tapsets/25290] New: process(EXE).begin may occasionally miss already-running target processes from EXE

fweimer at redhat dot com
https://sourceware.org/bugzilla/show_bug.cgi?id=25290

            Bug ID: 25290
           Summary: process(EXE).begin may occasionally miss
                    already-running target processes from EXE
           Product: systemtap
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: tapsets
          Assignee: systemtap at sourceware dot org
          Reporter: agentzh at gmail dot com
  Target Milestone: ---

I've noted that the process(EXE).begin probes may sometimes miss target
processes which are already running when staprun is started. To reproduce this:

1. Prepare a minimal a.c C program file:

```
#include <unistd.h>

int main (int argc, char** argv) {
    while (1) {
        usleep(1);
    }
    return 0;
}
```

Compile it like this:

    gcc -g a.c

which results in ./a.out generated. Run this program in a separate terminal
window like this:


    ./a.out

Keep this process running (it has an infinite loop so it will never exit by
itself).

2. Prepare a minimal a.stp script file:

```
global max_cnt = 5, cnt = 0;
global pids

probe process("/home/agentzh/a.out").begin {
    pids[pid()] = 1
}

probe process("/home/agentzh/a.out").end {
    delete pids[pid()];
}

probe process("/lib64/libc.so.6").function("usleep") {
    if (!pids[pid()])
        next;

    if (++cnt > max_cnt)
        exit();

    println("usleep(", @var("useconds"), ")");
}

probe timer.s(5) {
    warn("timer expired");
    exit();
}

probe begin {
    warn("Start tracing...")
}
```

And then compile this file to a kernel module file, usleep.ko:


    stap -p4 -m usleep a.stp


3. Run this usleep.ko module in a shell loop:


    ( while true; do echo ======; sudo staprun usleep.ko; sleep 0.1; done ) |&
tee a.txt

after a few minutes (be patient!), we should see "timer expired" messages from
the output file a.txt:

    $ grep -a timer a.txt
    WARNING: timer expired
    WARNING: timer expired
    WARNING: timer expired
    WARNING: timer expired
    WARNING: timer expired
    WARNING: timer expired
    WARNING: timer expired
    WARNING: timer expired

You may get fewer lines here though.

The timer expired message should never be printed when the process(EXE).begin
probe is fired properly.

I tried the latest master branch of stap on kernel 5.0.16-100.fc28.x86_64 from
Fedora 28, kernel 4.15.0-72-generic x86_64 from Ubuntu 18.04, and kernels
3.10.0-957.27.2.el7.x86_64 and 3.10.0-1062.9.1.el7.x86_64 from CentOS 7, all
show the same problem. I just noted that this is much harder to reproduce on
kernel 5.0.16 than earlier versions of the kernel.

The stap version:

```
$ /opt/stap/bin/stap -V
Systemtap translator/driver (version 4.3/0.174/0.177, commit
release-4.2-10-g1427836ac118)
Copyright (C) 2005-2019 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
tested kernel versions: 2.6.32 ... 5.4-rc6
enabled features: BPF LIBSQLITE3 NLS
```

Any hints on debugging this further? Thanks!

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug tapsets/25290] process(EXE).begin may occasionally miss already-running target processes from EXE

fweimer at redhat dot com
https://sourceware.org/bugzilla/show_bug.cgi?id=25290

--- Comment #1 from agentzh <agentzh at gmail dot com> ---
OK, this can be reproduced by a much simpler stap script (a simple one-liner):

    $ stap -m test -p4 -e 'probe process("/home/agentzh/a.out").begin {
println("Hit"); exit() } probe begin { println("Start tracing") }'

    $ while true; do sudo staprun test.ko; sleep 0.1; done

The second command would hang forever upon a "Start tracing" output line (when
./a.out is always running in another terminal).

And /home/agentzh/a.out is generated from exactly the same a.c source file
provided in my previous comment.

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug tapsets/25290] process(EXE).begin may occasionally miss already-running target processes from EXE

fweimer at redhat dot com
In reply to this post by fweimer at redhat dot com
https://sourceware.org/bugzilla/show_bug.cgi?id=25290

--- Comment #2 from agentzh <agentzh at gmail dot com> ---
Okay, with fche's guidance on IRC, I've narrowed it down to a race condition
that the utrace process.begin callback would be invoked when the stap sessions
state is still STAP_SESSION_STARTING instead of STAP_SESSION_RUNNING
occasionally.

As per fche's suggestion on IRC, I'd propose the following patch to fix this:

https://pastebin.com/WumwHdAT

I've tested this patch myself on 3.10 kernels from CentOS, 4.15 kernels from
Ubuntu and 5.0 kernels from Fedora. And it's quite stable now. Also tested the
process(EXE).end and process(EXE).thread.{begin,end} probes. All look fine now.

Feedback welcome! Thanks!

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug tapsets/25290] process(EXE).begin may occasionally miss already-running target processes from EXE

fweimer at redhat dot com
In reply to this post by fweimer at redhat dot com
https://sourceware.org/bugzilla/show_bug.cgi?id=25290

--- Comment #3 from agentzh <agentzh at gmail dot com> ---
Committed a modified version of the patch as 04d221c16.

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug tapsets/25290] process(EXE).begin may occasionally miss already-running target processes from EXE

fweimer at redhat dot com
In reply to this post by fweimer at redhat dot com
https://sourceware.org/bugzilla/show_bug.cgi?id=25290

agentzh <agentzh at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #4 from agentzh <agentzh at gmail dot com> ---
Consider it resolved.

--
You are receiving this mail because:
You are the assignee for the bug.