Re: [Bug translator/1276] support more timer varieties

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Bug translator/1276] support more timer varieties

Frank Ch. Eigler
Hi -

> [...]  timer.profile is always in interrupt, so current->tgid is not
> accessed.  I tried this without the in_interrupt check, and the pids
> I got looked fine.  [...]  The reason you might NOT want to read
> from current in an interrupt is that it may be misleading to the
> user if they're not expecting this.

Is that the only reason?  I was under the impression that it is also
possible for utterly fictional "current" values to be left in place,
which could cause problems if dereferenced.

> [...]
> I think the best solution would be if we had some sort of global flag
> that indicated that we're in a profiling interrupt.  That way we could
> have something like this:
>   function pid:long () %{
>       if (unlikely(in_interrupt() && !_stp_in_profiling_interrupt))
>           THIS->__retvalue = 0;
>       else
>           THIS->__retvalue = current->tgid;
>   %}

Or a "I believe `current' is safe to dereference here." flag.
I am rather embarrassed to suggest this in the interim:

#      if (unlikely(in_interrupt() &&
#                   strcmp(CONTEXT->probe_point, "timer.profile")))

- FChE
Reply | Threaded
Open this post in threaded view
|

RE: [Bug translator/1276] support more timer varieties

Stone, Joshua I
Frank Ch. Eigler wrote:

> Hi -
>
>> [...]  timer.profile is always in interrupt, so current->tgid is not
>> accessed.  I tried this without the in_interrupt check, and the pids
>> I got looked fine.  [...]  The reason you might NOT want to read
>> from current in an interrupt is that it may be misleading to the
>> user if they're not expecting this.
>
> Is that the only reason?  I was under the impression that it is also
> possible for utterly fictional "current" values to be left in place,
> which could cause problems if dereferenced.

AFAIK, current is always valid.  I know in the Vtune driver we
frequently use current during PMU interrupts, and there's never been an
issue.  Not to say that Vtune is difinitive - I scanned the kernel
sources as well, and didn't see anything making checks before accessing
current.  I also didn't find anything through Google.  So unless someone
knows otherwise, or has a reference to cast doubt on this, I think it's
safe.

The "let's not mislead the user" issue may be minor enough that we just
say who cares...

> Or a "I believe `current' is safe to dereference here." flag.

If we determine that there are specific unsafe instances of current,
then yes, this would be nice.

> I am rather embarrassed to suggest this in the interim:
>
> #      if (unlikely(in_interrupt() &&
> #                   strcmp(CONTEXT->probe_point, "timer.profile")))

Hmm... I think we can at least do:
#      if (unlikely(in_interrupt() &&
#                   (CONTEXT->probe_point != "timer.profile")))

Since they are both literal strings, string-pooling should make them
identical pointers.  We could whitelist probes this way, but...  yuck...

I hope we can get a better answer to the validity of "current".

Josh
Reply | Threaded
Open this post in threaded view
|

Re: [Bug translator/1276] support more timer varieties

Frank Ch. Eigler
Hi -

Josh wrote:

> AFAIK, current is always valid.  [...]  So unless someone knows
> otherwise, or has a reference to cast doubt on this, I think it's
> safe. [...]

I guess the "kernel-hacking" kernel/Documentation/DocBook file says
just this:

  <sect1 id="routines-current">
   <title><function>current</function>
    <filename class="headerfile">include/asm/current.h</filename></title>

   <para>
    This global variable (really a macro) contains a pointer to
    the current task structure, so is only valid in user context.
    For example, when a process makes a system call, this will
    point to the task structure of the calling process.  It is
    <emphasis>not NULL</emphasis> in interrupt context.
   </para>
  </sect1>

If it doesn't scare you off, then let's have a volunteer try disarming
some of those in_interrupt() conditionals (maybe replacing them with a
quick NULL or other simple pointer validity check), and write a few
stress tests (probes in uncomfortable spots for "current" usage).

- FChE
Reply | Threaded
Open this post in threaded view
|

Re: [Bug translator/1276] support more timer varieties

Roland McGrath
What this really means is that "current" always works, it's just not always
useful to think about that task.  i.e., in an interrupt handler, the
interrupt has nothing necessarily to do with the task that was current at
the time of the interrupt.  I'm fairly sure that "current" never yields
garbage, or even a task that isn't exactly the one that was on the CPU
running process-mode code last.
Reply | Threaded
Open this post in threaded view
|

RE: [Bug translator/1276] support more timer varieties

Stone, Joshua I
In reply to this post by Frank Ch. Eigler
Roland McGrath wrote:
> What this really means is that "current" always works, it's just not
> always useful to think about that task.  i.e., in an interrupt
> handler, the interrupt has nothing necessarily to do with the task
> that was current at the time of the interrupt.  I'm fairly sure that
> "current" never yields garbage, or even a task that isn't exactly the
> one that was on the CPU running process-mode code last.

That is my understanding as well... the pointer is always valid, it just
doesn't pertain to the interrupt itself.


Frank Ch. Eigler wrote:
> If it doesn't scare you off, then let's have a volunteer try disarming
> some of those in_interrupt() conditionals (maybe replacing them with a
> quick NULL or other simple pointer validity check), and write a few
> stress tests (probes in uncomfortable spots for "current" usage).

I'll volunteer for this (though others are welcome as well).  Have any
suggested probe points that are particularly uncomfortable?


Josh
Reply | Threaded
Open this post in threaded view
|

Re: [Bug translator/1276] support more timer varieties

Frank Ch. Eigler
Hi -

> > [...] write a few stress tests (probes in uncomfortable spots for
> > "current" usage).
>
> [...]  Have any suggested probe points that are particularly
> uncomfortable?

Places that come to mind are those routines that change the "current"
pointers, and routines (if any) involved in computing the "current"
macro, to test for reentrancy problems.

- FChE
Reply | Threaded
Open this post in threaded view
|

Re: [Bug translator/1276] support more timer varieties

Roland McGrath
> Places that come to mind are those routines that change the "current"
> pointers, and routines (if any) involved in computing the "current"
> macro, to test for reentrancy problems.

The macro is inlined into a little bit of assembly.  The pointer is stored
on the stack and never changed (you just switch stacks).  The only current
pointer that gets changed is the one on interrupt stacks, which is set up
before switching to the interrupt stack.  So the closest to "dangerous"
places would be inside context switching and inside interrupt handling.
Reply | Threaded
Open this post in threaded view
|

RE: [Bug translator/1276] support more timer varieties

Stone, Joshua I
In reply to this post by Frank Ch. Eigler
Roland McGrath wrote:
> the closest to "dangerous" places would be inside context switching
> and inside interrupt handling.

In an effort to stress as many of these cases as I could think of, I
wrote the script included below.  I was unable to get this to fail, even
under heavy system load.  I did my testing on RHEL4U2 x86_64, on a P4
w/HT enabled.  My stress test was a full kernel build with 'make -j20',
while also interacting with the desktop (login/out of gdm, load
openoffice & firefox, etc.).

When I was writing the test script, I did find a couple of probes that
failed, but it was unrelated to "current" dereferencing.  I will follow
up in another email...

If anyone feels this is not representative, please let me know how I can
improve the test and/or load.

It may be a good idea to add this sort of test to the pass-5 suite.
Perhaps a new directory is needed - systemtap.stress?

Thanks,

Josh


current.stp
===================================================================
#!/usr/bin/stap -g

global length
function execlen:long () %{ THIS->__retvalue = strlen(current->comm); %}
probe begin { log("systemtap starting probe") }
probe
    timer.profile,
    kernel.function("__switch_to"),
    %( kernel_vr == "2.6.9-22.ELsmp" %?
       %( arch == "x86_64" %?
          /* the lines before, at, and after the update of pcurrent */
 
kernel.statement("__switch_to@arch/x86_64/kernel/process.c:508"),
 
kernel.statement("__switch_to@arch/x86_64/kernel/process.c:509"),
 
kernel.statement("__switch_to@arch/x86_64/kernel/process.c:510"),
       %)
    %)
    kernel.function("*@kernel/sched.c"),
    kernel.function("*@kernel/sched.c").return,
    module("*").function("*interrupt*"),
    module("*").function("*interrupt*").return
{ length <<< execlen() }
probe end {
    log("systemtap ending probe")
    printf("count=%d\n", @count(length))
    printf("sum=%d\n", @sum(length))
    printf("min=%d\n", @min(length))
    printf("max=%d\n", @max(length))
    printf("avg=%d\n", @avg(length))

    /*
     * Check that the min & max lengths look reasonable.  If any string
was
     * either empty or too big, then the current pointer probably wasn't
     * valid, even though it dereferenced without crashing.
     */
    if (@min(length) > 0) {
        log("systemtap test success")
    } else {
        log("unexpected minimum length")
        log("systemtap test failure")
    }
    if (@max(length) < %( kernel_v >= "2.6.11" %? TASK_COMM_LEN %: 16
%)) {
        log("systemtap test success")
    } else {
        log("unexpected maximum length")
        log("systemtap test failure")
    }
}
Reply | Threaded
Open this post in threaded view
|

RE: [Bug translator/1276] support more timer varieties

Stone, Joshua I
In reply to this post by Frank Ch. Eigler
Stone, Joshua I wrote:
> When I was writing the test script, I did find a couple of probes that
> failed, but it was unrelated to "current" dereferencing.  I will
> follow up in another email...

As promised, here they are.  I tried these with very simple probe bodies
(increment a global number), so I know the crashes have nothing to do
with "current" dereferencing.

The first one that failed on me was 'kernel.inline("get_current")'.  I
figured this would be a dangerous one, but I wanted to try it.  This
matched 2710 locations for me, and probably a lot of them are in
critical locations.  When I ran this, I got "NMI Watchdog detected
LOCKUP".  It probably isn't worth trying to make this one work, but if
we have a blacklist, this should be on it.

The second one that failed was 'kernel.function("__switch_to").return'.
This one is a problem with kretprobes only, as all of my other probes in
__switch_to behaved just fine, even in the middle of the function.
Running this gave "Kernel BUG at kprobes:449" (the full dump is included
below).  The line mentioned is in trampoline_probe_handler:

    BUG_ON(!orig_ret_address || (orig_ret_address ==
trampoline_address));

It seems pretty obvious that the actions taken in __switch_to would
conflict with the way the trampoline works.  I don't know if it's
possible to make kretprobes work on this function - if not, it should be
blacklisted.


Josh


crashdump from kernel.function("__switch_to").return
===================================================================
Kernel BUG at kprobes:449
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: stap_4037(U) nfsd exportfs lockd md5 ipv6
parport_pc lp parport autofs4 i2c_dev i2c_core smbfs sunrpc ds
yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack
iptable_filter ip_tables dm_mirror dm_mod button battery ac
joydev uhci_hcd ehci_hcd shpchp hw_random ata_piix snd_azx
snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc sk98lin(U) floppy ext3 jbd ahci libata
sd_mod scsi_mod
Pid: 4006, comm: sshd Tainted: GF     2.6.9-22.ELsmp
RIP: 0010:[<ffffffff80121809>]
<ffffffff80121809>{trampoline_probe_handler+130}
RSP: 0018:0000010071d81b28  EFLAGS: 00010002
RAX: 0000000000000001 RBX: 000001007f9d4600 RCX: ffffffffa0000000
RDX: 0000000000000001 RSI: 0000010071d81bf8 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000010037e14000 R09: 0000000000000003
R10: 00000000000000ff R11: 0000000000000003 R12: 0000000000000000
R13: 0000010071d81bf8 R14: 0000000000000000 R15: 0000010002c1a5e0
FS:  0000002a96a33280(0000) GS:ffffffff804d3180(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000383c38f070 CR3: 0000000037e18000 CR4: 00000000000006e0
Process sshd (pid: 4006, threadinfo 0000010071d80000, task
0000010076dc3030)
Stack: ffffffff803d16a0 ffffffff803d16a0 ffffffff80121785
0000010071d81bf8
       ffffffff80121786 ffffffff80121751 ffffffff803db140
0000010071d81bb8
       0000000000000002 0000010076dc3030
Call Trace:
       <ffffffff80121785>{kretprobe_trampoline+0}
       <ffffffff80121786>{kretprobe_trampoline+1}
       <ffffffff80121751>{kprobe_handler+559}
       <ffffffff80121a27>{kprobe_exceptions_notify+40}
       <ffffffff801432fb>{notifier_call_chain+31}
       <ffffffff80112460>{do_int3+66}
       <ffffffff80110aed>{error_exit+0}
       <ffffffff80121786>{kretprobe_trampoline+1}
       <ffffffff80121785>{kretprobe_trampoline+0}
       <ffffffff80222bad>{tty_ldisc_try+60}
       <ffffffff80303924>{schedule_timeout+101}
       <ffffffff80222ce0>{tty_ldisc_deref+103}
       <ffffffff8018902b>{do_select+939}
       <ffffffff80188bc5>{__pollwait+0}
       <ffffffff801893aa>{sys_select+820}
       <ffffffff801908fc>{dnotify_parent+34}
       <ffffffff80110052>{system_call+126}

Code: 0f 0b 64 b8 31 80 ff ff ff ff c1 01 49 89 ad 80 00 00 00 e8
RIP <ffffffff80121809>{trampoline_probe_handler+130} RSP
<0000010071d81b28>
Reply | Threaded
Open this post in threaded view
|

RE: [Bug translator/1276] support more timer varieties

Keshavamurthy, Anil S
In reply to this post by Frank Ch. Eigler
Josh,
        We fixed similar return probe bug in RHEL4-U3 kernel. So, Can
you please try the same
test on  RHEL4-U3 Beta and get back to the mailing list.

Thanks,
Anil Keshavamurthy

>-----Original Message-----
>From: [hidden email]
>[mailto:[hidden email]] On Behalf Of Stone, Joshua I
>Sent: Wednesday, December 07, 2005 6:40 PM
>To: [hidden email]
>Subject: RE: [Bug translator/1276] support more timer varieties
>
>Stone, Joshua I wrote:
>> When I was writing the test script, I did find a couple of
>probes that
>> failed, but it was unrelated to "current" dereferencing.  I will
>> follow up in another email...
>
>As promised, here they are.  I tried these with very simple
>probe bodies
>(increment a global number), so I know the crashes have nothing to do
>with "current" dereferencing.
>
>The first one that failed on me was 'kernel.inline("get_current")'.  I
>figured this would be a dangerous one, but I wanted to try it.  This
>matched 2710 locations for me, and probably a lot of them are in
>critical locations.  When I ran this, I got "NMI Watchdog detected
>LOCKUP".  It probably isn't worth trying to make this one work, but if
>we have a blacklist, this should be on it.
>
>The second one that failed was 'kernel.function("__switch_to").return'.
>This one is a problem with kretprobes only, as all of my other
>probes in
>__switch_to behaved just fine, even in the middle of the function.
>Running this gave "Kernel BUG at kprobes:449" (the full dump
>is included
>below).  The line mentioned is in trampoline_probe_handler:
>
>    BUG_ON(!orig_ret_address || (orig_ret_address ==
>trampoline_address));
>
>It seems pretty obvious that the actions taken in __switch_to would
>conflict with the way the trampoline works.  I don't know if it's
>possible to make kretprobes work on this function - if not, it
>should be
>blacklisted.
>
>
>Josh
>
>
>crashdump from kernel.function("__switch_to").return
>===================================================================
>Kernel BUG at kprobes:449
>invalid operand: 0000 [1] SMP
>CPU 1
>Modules linked in: stap_4037(U) nfsd exportfs lockd md5 ipv6
>parport_pc lp parport autofs4 i2c_dev i2c_core smbfs sunrpc ds
>yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack
>iptable_filter ip_tables dm_mirror dm_mod button battery ac
>joydev uhci_hcd ehci_hcd shpchp hw_random ata_piix snd_azx
>snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
>soundcore snd_page_alloc sk98lin(U) floppy ext3 jbd ahci libata
>sd_mod scsi_mod
>Pid: 4006, comm: sshd Tainted: GF     2.6.9-22.ELsmp
>RIP: 0010:[<ffffffff80121809>]
><ffffffff80121809>{trampoline_probe_handler+130}
>RSP: 0018:0000010071d81b28  EFLAGS: 00010002
>RAX: 0000000000000001 RBX: 000001007f9d4600 RCX: ffffffffa0000000
>RDX: 0000000000000001 RSI: 0000010071d81bf8 RDI: 0000000000000000
>RBP: 0000000000000000 R08: 0000010037e14000 R09: 0000000000000003
>R10: 00000000000000ff R11: 0000000000000003 R12: 0000000000000000
>R13: 0000010071d81bf8 R14: 0000000000000000 R15: 0000010002c1a5e0
>FS:  0000002a96a33280(0000) GS:ffffffff804d3180(0000)
>knlGS:0000000000000000
>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>CR2: 000000383c38f070 CR3: 0000000037e18000 CR4: 00000000000006e0
>Process sshd (pid: 4006, threadinfo 0000010071d80000, task
>0000010076dc3030)
>Stack: ffffffff803d16a0 ffffffff803d16a0 ffffffff80121785
>0000010071d81bf8
>       ffffffff80121786 ffffffff80121751 ffffffff803db140
>0000010071d81bb8
>       0000000000000002 0000010076dc3030
>Call Trace:
>       <ffffffff80121785>{kretprobe_trampoline+0}
>       <ffffffff80121786>{kretprobe_trampoline+1}
>       <ffffffff80121751>{kprobe_handler+559}
>       <ffffffff80121a27>{kprobe_exceptions_notify+40}
>       <ffffffff801432fb>{notifier_call_chain+31}
>       <ffffffff80112460>{do_int3+66}
>       <ffffffff80110aed>{error_exit+0}
>       <ffffffff80121786>{kretprobe_trampoline+1}
>       <ffffffff80121785>{kretprobe_trampoline+0}
>       <ffffffff80222bad>{tty_ldisc_try+60}
>       <ffffffff80303924>{schedule_timeout+101}
>       <ffffffff80222ce0>{tty_ldisc_deref+103}
>       <ffffffff8018902b>{do_select+939}
>       <ffffffff80188bc5>{__pollwait+0}
>       <ffffffff801893aa>{sys_select+820}
>       <ffffffff801908fc>{dnotify_parent+34}
>       <ffffffff80110052>{system_call+126}
>
>Code: 0f 0b 64 b8 31 80 ff ff ff ff c1 01 49 89 ad 80 00 00 00 e8
>RIP <ffffffff80121809>{trampoline_probe_handler+130} RSP
><0000010071d81b28>
>
Reply | Threaded
Open this post in threaded view
|

RE: [Bug translator/1276] support more timer varieties

Stone, Joshua I
In reply to this post by Frank Ch. Eigler
Keshavamurthy, Anil S wrote:
> We fixed similar return probe bug in RHEL4-U3 kernel. So, Can
you
> please try the same test on  RHEL4-U3 Beta and get back to the
> mailing list.

I updated only the kernel-* rpms - that's all that matters here, right?
I'm now on 2.6.9-24.ELsmp...

I get the same result, only with the line number changed: "Kernel BUG at
kprobes:440".  This line 440 is still the same BUG_ON check as I posted
before.

Here is the script I use to trigger it:

  global i
  probe begin { log("starting probe") }
  probe kernel.function("__switch_to").return { ++i }
  probe end { printf("ending probe (%d)\n", i) }

I also trimmed it down to a very small C file that still exhibits the
behavior, included below.

Let me know if there's more info I can provide...

Thanks,

Josh


===================================================================
#include <linux/module.h>
#include <linux/kprobes.h>

#ifdef ARCH_SUPPORTS_KRETPROBES
static int
probe_enter (struct kretprobe_instance *probe_instance, struct pt_regs
*regs) {
  return 0;
}

//
kernel.function("__switch_to@arch/x86_64/kernel/process.c:433").return
static struct kretprobe probe = {
  .kp.addr= (void *) 0xffffffff8010ebea,
  .handler = &probe_enter,
  .maxactive = 0,
};
#endif /* ARCH_SUPPORTS_KRETPROBES */

int init_module(void) {
  #ifdef ARCH_SUPPORTS_KRETPROBES
  return register_kretprobe (&probe);
  #else
  return -1;
  #endif
}

void cleanup_module(void) {
  #ifdef ARCH_SUPPORTS_KRETPROBES
  unregister_kretprobe (&probe);
  #endif
}

MODULE_DESCRIPTION("systemtap probe");
MODULE_LICENSE("GPL");
Reply | Threaded
Open this post in threaded view
|

RE: [Bug translator/1276] support more timer varieties

Keshavamurthy, Anil S
In reply to this post by Frank Ch. Eigler
 
>Keshavamurthy, Anil S wrote:
>> We fixed similar return probe bug in RHEL4-U3 kernel.
>So, Can you
>> please try the same test on  RHEL4-U3 Beta and get back to the
>> mailing list.
>
>I updated only the kernel-* rpms - that's all that matters
>here, right?  

Yup, that is correct.

>Let me know if there's more info I can provide...

I think I have all the info need to reproduce this and I will debug and
will get back to you.

Thanks,
Anil
Reply | Threaded
Open this post in threaded view
|

kretprobes vs __switch_to

Roland McGrath
In reply to this post by Stone, Joshua I
> The second one that failed was 'kernel.function("__switch_to").return'.
> This one is a problem with kretprobes only, as all of my other probes in
> __switch_to behaved just fine, even in the middle of the function.
> Running this gave "Kernel BUG at kprobes:449" (the full dump is included
> below).  The line mentioned is in trampoline_probe_handler:
>
>     BUG_ON(!orig_ret_address || (orig_ret_address ==
> trampoline_address));
>
> It seems pretty obvious that the actions taken in __switch_to would
> conflict with the way the trampoline works.  I don't know if it's
> possible to make kretprobes work on this function - if not, it should be
> blacklisted.

It's not clear to me why this fails.  It's actually some assembly code in
the switch_to macro (used in context_switch) that changes stacks, before it
calls __switch_to.  __switch_to is entered on the new stack and it seems to
me that kretprobes ought to be able to handle it normally.  Clearly I'm
missing something.