[Bug runtime/26131] New: Relay flushing results in garbled output data being read by staprun

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug runtime/26131] New: Relay flushing results in garbled output data being read by staprun

Sourceware - systemtap mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26131

            Bug ID: 26131
           Summary: Relay flushing results in garbled output data being
                    read by staprun
           Product: systemtap
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: agentzh at gmail dot com
  Target Milestone: ---

We observed some garbled data ('\0') would randomly appear in staprun's stdout
output or `-o FILE` option's file output when the machine is under load. It can
be demonstrated by the following simple stap script:

https://gist.github.com/agentzh/b44384ad9f3804d85ddfbf771094a718

When running this script's ko module with staprun in 20 ~ 30 threads, we can
get a sample output with `\0` in it every 30 sec to 3 min. One bad sample
output file is like this (generated by `staprun -o FILE`):

https://gist.github.com/agentzh/5967e0a9ce90e494ee0c88cb4d9ecef6

It is reproducible with a very recent systemtap git master (commit fb59e8c75).

The `staprun -vvv` output for the bad runs is all similar to the following:

https://gist.github.com/agentzh/283a15f617f751cf9faa786632b1e516

We can see that the staprun receives `STP_REQUEST_EXIT` and sends back a
`STP_STOP` command back to ko before its reader thread has read anything from
the relay data debugfs file. The `reader_thread:237` log lines are added by
myself.

Digging deeper I found that there is a race condition in kernel's
`relay_switch_subbuf` function which is called by `relay_flush()` in systemtap
runtime's `_stp_transport_data_fs_stop` function. To be more specific, it
happens on the following line of the kernel's `relay_switch_subbuf`:

```
irq_work_queue(&buf->wakeup_work);
```

When this line is being executed, the `relay_file_read()` function reads the
relay buf data structures which is partially set by the `relay_switch_subbuf`
function. The latter needs to set the following fields in the `struct
rchan_buf` data structure:

* buf->offset
* buf->padding[new_subbuf]
* buf->subbufs_produced
* buf->data

But at the point of the `irq_work_queue()` call, only `buf->subbufs_produced`
is assigned, leaving `buf` in an inconsistent state, confusing
`relay_file_read()` to read the full newly switched empty buffer (which is
usually full of \0 bytes), as evidenced by the following dmesg messages
produced by my own `printk` calls in the systemtap runtime:

```
[203515.224386] return_30253: _stp_transport_data_fs_stop:334
[203515.224393] relay before flush: subbuf_size: 262144, n_subbufs: 8, chan
global? 1, prev padding: 0, subbufs produced: 0, subbufs consumed: 0, bytes
consumed: 21, offset: 21, paddings: 0 0, early bytes: 0
[203515.224531] return_30253: __stp_relay_switch_subbuf:106
[203515.224534] XXX line 106: subbufs_consumed changed! offset: 21, subbufs
consumed; 1
[203515.224537] return_30253: __stp_relay_subbuf_start_callback:225
[203515.224540] subbuf start returned 1: overwrite: 0, relay buf full: 0
[203515.224542] return_30253: _stp_transport_data_fs_stop:350
[203515.224547] relay after flush: subbufs produced: 1, subbufs consumed: 1,
bytes consumed: 0, offset: 0, paddings: 262123 0, early bytes: 0, buf ==
per_cpu? 1
```

Using the numbers in this dmesg output, we can successfully determine the
following execution sequence on the relay reader side (kernel space only):

```
relay_file_read
  relay_file_read_avail
    relay_file_read_consume
      relay_subbufs_consumed(buf->chan, buf->cpu, 1)
        buf->subbufs_consumed += subbufs_consumed;  // subbufs_consumed == 1
      buf->bytes_consumed = 0;
    return 1;
  read_start = relay_file_read_start_pos(*ppos, buf)  // *ppos == 0
    read_pos = 262144; // subbuf_size
    return read_pos;
  // read_start = 262144;
  avail = relay_file_read_subbuf_avail(read_start, buf);  // avail == 262144
```

where 262144 is the default subbuf size set by the stap runtime (65536*4).

After switching the `relay_flush()` call from the kernel version to systemtap's
own version based on systemtap's own `__stp_relay_switch_subbuf()` can mitigate
the problem to the extend that it's very hard to reproduce the race with the
simple .stp script given above. This is because systemtap's
`__stp_relay_switch_subbuf()` intentionally avoids the `irq_work_queue()` call.
But still, we can get garbled output using .stp scripts producing more output
data, like this one:

```
global N = 20000

probe oneshot {
  exit()
}

probe end {
  for (i = 0; i < N; i++)
      printf("hello world %d !!! %d\n", i, i + 1)
}
```

This script generates 130895 bytes of data (about 128KB). This is due to the
fact that both implementations of `relay_switch_subbuf` has inherent race
conditions with the concurrent reader. There is no locking protection around
the statements setting those `buf->xxx` fields. So there's always a chance that
the reader would interrupt in the middle of that instruction sequence. The
right way would be introduce a spinlock around the critical section of
`relay_switch_subbuf` and also make the reader respect that.

The simplest patch would be remove the `relay_flush()` call from the
`_stp_transport_data_fs_stop` function. If no subbuf switching, then no races
at all. That flushing call is not necessary anyway because `relay_close()`
would defer the deallocation of the relay bufs until their reference counting
reaches zero. And load testing using the script above reveals no data loss in
the end of the output stream. This is not a complete fix even for systemtap
alone, however, because there are other places in the systemtap runtime which
would call `__stp_relay_switch_subbuf()`, like in function
`_stp_data_write_reserve`. A second patch I would suggest is to reorder the
statements of the `buf` field assignments to minimize the race window even
though completely eliminating it would not be possible if not adding a lock
around the critical sections. Anyway, these two patches do not require kernel
patching and I'll prepare them shortly for review unless there are better
ideas.

As far as I can see, this racing bug exists in both older kernels like CentOS
7's 3.10 kernels, and also newer kernels found in Fedora (or the kernel's git
master).

Feedback welcome! Thanks!

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug runtime/26131] Relay flushing results in garbled output data being read by staprun

Sourceware - systemtap mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26131

--- Comment #1 from agentzh <agentzh at gmail dot com> ---
OK, we should keep the relay_flush(). It is actually systemtap runtime's fault
not using the inode locks to protect against race conditions between readers
and writers for the relay stuff.

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug runtime/26131] Relay flushing results in garbled output data being read by staprun

Sourceware - systemtap mailing list
In reply to this post by Sourceware - systemtap mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26131

agentzh <agentzh at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #2 from agentzh <agentzh at gmail dot com> ---
Fixed in commit 469d20e8b and pushed to master.

--
You are receiving this mail because:
You are the assignee for the bug.