static instrumentation for kernel

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

static instrumentation for kernel

Frank Ch. Eigler
Hi -

Here is one set of ideas about inserting static instrumentation points
into the kernel.  It predates but is related to the discussion this
summer: <http://sources.redhat.com/ml/systemtap/2005-q3/msg00122.html>
It has what is perhaps an interesting combination of features.  It is
simple, architecture-neutral, does not require nonlocal artifacts like
per-probe declarations, and hopefully is not that slow.  There are
certainly some shortcomings and oversights - please be critical.


The code to be inserted into kernel sources would be a plain macro
call such as:

   SYSTEMTAP_PROBE(name)
   SYSTEMTAP_PROBE_N(name,arg1) // arg1 castable to int64_t numeric
   SYSTEMTAP_PROBE_NS(name,arg1,arg2) // arg2 castable to char* string

The name should be unique within the function.  As you see, arguments
can be passed, encoding the type/arity into the macro name.  Possibly
some super clever typeof() conditionals can make that implicit.
What these macros would expand to is the following.  We'd generate a
menu of these for reasonable arities/type combinations and shove them
into a kernel header.

#define SYSTEMTAP_PROBE(name) \
   do { \
       static void (*__systemtap_probe_##name)(); \
       if (unlikely(__systemtap_probe_##name)) \
           (__systemtap_probe_##name) ();  \
      } while (0)
#define SYSTEMTAP_PROBE_NS(name,arg1,arg2) \
   do { \
       static void (*__systemtap_probe_ns_##name)(int64_t, const char*); \
       if (unlikely(__systemtap_probe_ns_##name)) \
           (__systemtap_probe_ns_##name) ((int64_t)(arg1), \
                                          (const char *)(arg2));  \
      } while (0)

As you see, the gist of it is a conditional call through a function
pointer, where the pointer is in a static variable.  Its name is
stylized: it encodes the probe name, and its parameter arity/type
signature.  (It might need some annotations to make sure the compiler
doesn't elide it, so that it has a convenient alignment, etc.)

Normally, the variable is NULL, so a dormant probe costs a NULL check
of a memory word, plus a likely conditional jump over the function
call.  When a probe is activated, systemtap arranges to overwrite the
NULL with the entry address of a probe handler function.  (This would
mean no sharing of a static probe point between systemtap sessions for
now.)  When the kernel trips across an activated probe point, it just
does the obvious: calls into a function in the systemtap module.  The
indirect call would be somewhat slower but much simpler than a
djprobe, and much faster than a kprobe.  It would be great if this
someone volunteered to microbenchmark this macro family.

OK, now the script side.  Systemtap would support a new family of
probe points:

   probe kernel.probe("name") { print ($arg1) }
   probe module("foo").probe("name") { ... same ... }

with the "name" portion being optionally annotated with enough
function / compilation-unit identifying suffixes to make it unique.
Incoming arguments from the macro calls would be mapped to script
variables named $arg1 etc.  The probe handler is otherwise completely
normal, and would operate under the same sorts of constraints that a
kprobe handler does (atomicity, limited runtime, etc.).

Finally, the translator side.  When it encounteres these ".probe()"
probe points, systemtap would look through the symbol table for the
referenced kernel/module, searching for those static variables with
the stylized names.  It stashes away the addresses of those variables
for setting/clearing during session startup/shutdown.  Because the
arity/type information is encoded in the stylized names, it can
generate interface functions with exactly matching signatures for the
static function pointer.  It can emit code to safely copy the incoming
parameters to statically typed pseudo-target variables.  Automagically
type-safe.


That's it.  Please let me know if the above is unclear or faulty.


- FChE
Reply | Threaded
Open this post in threaded view
|

Re: static instrumentation for kernel

Mathieu Desnoyers-2
Hi,

Just so you remember me, I was in LTT team at the OLS presentation when we first
discussed about these "markers". I am presently working on the next generation
of LTT (the LTTng tracer and LTTV viewer). As I speak, there is already a
feature-incomplete version of the tool ready (see http://ltt.polymtl.ca).


* Frank Ch. Eigler ([hidden email]) wrote:
> The code to be inserted into kernel sources would be a plain macro
> call such as:
>
>    SYSTEMTAP_PROBE(name)
>    SYSTEMTAP_PROBE_N(name,arg1) // arg1 castable to int64_t numeric
>    SYSTEMTAP_PROBE_NS(name,arg1,arg2) // arg2 castable to char* string
>

It seems to somehow limit the variations of parameters that can be passed as
argument : one could need a particular macro for a (int64_t,void*,int32_t,byte).
We can imagine thousands of variations. Or maybe the goal is just to allow a
small subset of types ?

> The name should be unique within the function.  As you see, arguments
> can be passed, encoding the type/arity into the macro name.  Possibly
> some super clever typeof() conditionals can make that implicit.

The problem with the precompiler is that it's not very "variable argument list"
friendly. It only passes it as __VA_ARGS__ to another function call, but doesn't
play with them.

A solution for using typeof() here would be to do something that looks like
system calls :

SYSTEMTAP_PROBE(name)
SYSTEMTAP_PROBE1(name,arg1)
SYSTEMTAP_PROBE2(name,arg1,arg2)
SYSTEMTAP_PROBE3(name,arg1,arg2,arg3)  and so on..

But I'm afraid I can't figure out how to make this work with typeof().


> What these macros would expand to is the following.  We'd generate a
> menu of these for reasonable arities/type combinations and shove them
> into a kernel header.
>
> #define SYSTEMTAP_PROBE(name) \
>    do { \
>        static void (*__systemtap_probe_##name)(); \
>        if (unlikely(__systemtap_probe_##name)) \
>            (__systemtap_probe_##name) ();  \
>       } while (0)
> #define SYSTEMTAP_PROBE_NS(name,arg1,arg2) \
>    do { \
>        static void (*__systemtap_probe_ns_##name)(int64_t, const char*); \
>        if (unlikely(__systemtap_probe_ns_##name)) \
>            (__systemtap_probe_ns_##name) ((int64_t)(arg1), \
>                                           (const char *)(arg2));  \
>       } while (0)
>
> As you see, the gist of it is a conditional call through a function
> pointer, where the pointer is in a static variable.  Its name is
> stylized: it encodes the probe name, and its parameter arity/type
> signature.  (It might need some annotations to make sure the compiler
> doesn't elide it, so that it has a convenient alignment, etc.)
>

The problem with the if() { call } scheme comes when you want to deactivate the
probe. Upon activation, writing the function pointer will be ok for the threads
that pass through this code : they won't do the call until the pointer becomes
non NULL.

When the probe is removed, a thread could clearly end up calling a bad function
pointer by being stalled between the if and the call. You have to take care of
it by making it point to an always valid empty function, but still deactivating
the call with the condition for performance purposes.


> [...] The
> indirect call would be somewhat slower but much simpler than a
> djprobe, and much faster than a kprobe.  It would be great if this
> someone volunteered to microbenchmark this macro family.
>

But still slower than an inlined function. Inlining has this interesting point :
it doesn't have to build up the function call by putting the arguments on the
stack and does not do the call itself.

If you want to do static tracing, you should really investigate the fastest
solutions available.

In LTTng and LTTV, we use a different approach. We have an "event" code
generator. It takes an event description as input and produces the C logging
function. It can then be easily included as a header with all the versatility of
the C language. Defining a new event becomes as simple as describing the
associated data structure, running genevent, including the header and calling
an inline function to log the event.

If you look for a simple and fast static tracing solution, you might want to
look at LTTng here : http://ltt.polymtl.ca > New features.

I think that both dynamic and static tracers might integrate well together. And
did I say that we have a modular trace viewer (LTTV) ?


Regards,


Mathieu Desnoyers



OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
Reply | Threaded
Open this post in threaded view
|

Re: static instrumentation for kernel

Frank Ch. Eigler
Hi -

> [...]
> >    SYSTEMTAP_PROBE(name)
> >    SYSTEMTAP_PROBE_N(name,arg1) // arg1 castable to int64_t numeric
> >    SYSTEMTAP_PROBE_NS(name,arg1,arg2) // arg2 castable to char* string
> >

> It seems to somehow limit the variations of parameters that can be
> passed as argument : one could need a particular macro for a
> (int64_t,void*,int32_t,byte).  We can imagine thousands of
> variations. Or maybe the goal is just to allow a small subset of
> types ?

Since systemtap uses (approximately) only two data types, the
combinatorial explosion is not so bad.  Integers and pointers are
castable to int64_t, and char* strings can get passed as is.  So we
would have 2 ** (maximum-arity) macros to generate, say 256.


> [...]
> SYSTEMTAP_PROBE3(name,arg1,arg2,arg3)  and so on..
> But I'm afraid I can't figure out how to make this work with typeof().

Yeah, typeof() / __builtin_types_compatible_p() is a compile-time (not
preprocessor-time) constant expression.  If one could count on the
compiler removing all code (including the static function pointer)
from the false branches of such compile-time type tests, then it may
"just work" to emit N 2**N-long macros.


> The problem with the if() { call } scheme comes when you want to
> deactivate the probe. [...]  When the probe is removed, a thread
> could clearly end up calling a bad function pointer by being stalled
> between the if and the call. [...]

Thanks for pointing out that such concurrency issues have to be
investigated.  Since session startup/shutdown is not as time critical
as probe execution is, something as conservative as the
djprobes-inspired IPI/quiescence could do the trick here.


> > [...] The indirect call would be somewhat slower but much simpler
> > than a djprobe, and much faster than a kprobe.  [...]

> But still slower than an inlined function. Inlining has this
> interesting point: it doesn't have to build up the function call by
> putting the arguments on the stack and does not do the call itself.

Indeed, but if the code to be inlined is to be changeable (i.e., let
the user write an arbitrary probe handler), then this can't work
without kernel recompilation/rebooting each time.


> If you want to do static tracing, you should really investigate the
> fastest solutions available.

The thing is, I don't just want to do *tracing*.


> [...]  And did I say that we have a modular trace viewer (LTTV) ?

Emitting data in a compatible form could be very useful for systemtap
users.


- FChE
Reply | Threaded
Open this post in threaded view
|

Re: static instrumentation for kernel

Tom Zanussi
In reply to this post by Frank Ch. Eigler
Frank Ch. Eigler writes:
 > Hi -
 >
 > Here is one set of ideas about inserting static instrumentation points
 > into the kernel.  It predates but is related to the discussion this
 > summer: <http://sources.redhat.com/ml/systemtap/2005-q3/msg00122.html>
 > It has what is perhaps an interesting combination of features.  It is
 > simple, architecture-neutral, does not require nonlocal artifacts like
 > per-probe declarations, and hopefully is not that slow.  There are
 > certainly some shortcomings and oversights - please be critical.
 >
 >
 > The code to be inserted into kernel sources would be a plain macro
 > call such as:
 >
 >    SYSTEMTAP_PROBE(name)
 >    SYSTEMTAP_PROBE_N(name,arg1) // arg1 castable to int64_t numeric
 >    SYSTEMTAP_PROBE_NS(name,arg1,arg2) // arg2 castable to char* string
 >
 > The name should be unique within the function.  As you see, arguments
 > can be passed, encoding the type/arity into the macro name.  Possibly
 > some super clever typeof() conditionals can make that implicit.
 > What these macros would expand to is the following.  We'd generate a
 > menu of these for reasonable arities/type combinations and shove them
 > into a kernel header.
 >
 > #define SYSTEMTAP_PROBE(name) \
 >    do { \
 >        static void (*__systemtap_probe_##name)(); \
 >        if (unlikely(__systemtap_probe_##name)) \
 >            (__systemtap_probe_##name) ();  \
 >       } while (0)
 > #define SYSTEMTAP_PROBE_NS(name,arg1,arg2) \
 >    do { \
 >        static void (*__systemtap_probe_ns_##name)(int64_t, const char*); \
 >        if (unlikely(__systemtap_probe_ns_##name)) \
 >            (__systemtap_probe_ns_##name) ((int64_t)(arg1), \
 >                                           (const char *)(arg2));  \
 >       } while (0)
 >

[...]

Just to throw another idea into the mix...

Here's some code I've been playing around with in an effort to provide
some better and more useful 'real-world' relay-app examples (the 'qdt'
in the code stands for 'quick and dirty' tracer).  It's a set of
macros that automatically generates event structs and ids.  It's a
work in progress - the point of it is to make it relatively easy for
developers to add new events when doing 'ad hoc' tracing.  As the name
implies, it's not meant to be used for production systems, as it
requires a rebuild of the kernel to add a new event (but I guess
adding new static tracepoints requires that in any case), but since it
also does some autogeneration for the purposes of static logging, it
might be of some interest.

Basically, to add a new event, you add a simple event description to a
header file and in the code you want to trace, and some boilerplate
code that fills in the struct and logs it.  On the user side, event
descriptions are available as proc files (or the common header can be
included in the user app and recompiled).

i.e. to add a new event, you add a line to the EVENTS #define and add
a #define for each event as below, the pattern should be obvious.

/* start event definitions */

#define EVENTS(ACTION, sep)            \
        ACTION(kmalloc_trace) sep \
        ACTION(kfree_trace) sep

#define kmalloc_trace_fields(event, event_name, ACTION) \
    ACTION(event, event_name, alloc_addr, void *)       \
    ACTION(event, event_name, alloc_size, size_t)       \
    ACTION(event, event_name, obj_size, int)

#define kfree_trace_fields(event, event_name, ACTION)   \
    ACTION(event, event_name, free_addr, void *)        \
    ACTION(event, event_name, obj_size, int)

/* end event definitions */

/* struct/id generation macros, inspired by a comp.lang.c++.moderated
   posting by Christopher Eltschka */

#define DECLARE(event, event_name, field, type) type field;

#define DECLARE_EVENT(event_name)       \
struct event_name##_struct \
{                               \
        unsigned char event_id;         \
        struct timeval timestamp;       \
        event_name##_fields(NULL, event_name, DECLARE)  \
} __attribute__((__packed__))

#define REGISTER(event, event_name, field, type) \
        register_qdt_field(event, #field, #type, offsetof(struct event_name##_struct, field), sizeof(((struct event_name##_struct *)0)->field));

#define REGISTER_EVENT(event_name)              \
        {                                       \
                struct qdt_event *event = register_qdt_event(#event_name, event_name, sizeof(struct event_name##_struct)); \
                event_name##_fields(event, event_name, REGISTER);       \
        }

#define EVENT_ID(event_name) event_name

#define ID_EVENTS(events) \
        enum qdt_event_id { \
                events \
        }

#define COMMA ,
#define SEMICOLON ;
       
/* auto-create the event id enum */
ID_EVENTS(EVENTS(EVENT_ID, COMMA));

/* auto-create the event structs */
EVENTS(DECLARE_EVENT, SEMICOLON);

The 'register' functions and macros make the event descriptions appear
in proc files, for a userspace app to read at runtime.  This approach is
similar to what web100 does - thanks to Baruch Even for the suggestion
to take a look at this mechanism.

static int init(void)
{
        /* create the event description proc files */
        EVENTS(REGISTER_EVENT, SEMICOLON);
}

Finally, at the instrumentation point, space for the event is reserved
and the data is written into the reserved space, using the
autogenerated event id and struct definition.  qdt_reserve() reserves
space for the event, and also fills in the common fields such as event
id and timestamp before returning, upon which the logging code
directly fills in the rest of the struct.  The nice thing about
knowing the event size ahead of time is that it makes it easy to write
directly into the spot reserved for the data, and it also removes the
need for any intermediate buffering.

in mm/slab.c/__kmalloc():

        /* log kmalloc event */
        if (qdt_chan) {
                unsigned long qdt_flags;
                struct kmalloc_trace_struct *qdt_event;

                local_irq_save(qdt_flags);
                qdt_event = qdt_reserve(kmalloc_trace);
                /* qdt_event is a pointer to reserved memory, which
                   we treat as a pointer to the event struct */
                if (qdt_event) {
                        qdt_event->alloc_addr = alloc_addr;
                        qdt_event->alloc_size = size;
                        qdt_event->obj_size = obj_size;
                }
                /* nothing left to do, we've reserved and written */
                local_irq_restore(qdt_flags);
        }

For efficiency, a top-level check for the open channel skips over the
block if there's no channel open.  Part of the process of closing the
channel is to first switch the channel buffer with a junk page, so if
the channel goes away between the outer check and the qdt_reserve(),
in that case qdt_reserve() will simply reserve junk in the junk page
with no harm done.  At least that's my current plan, such as it is.

The user side can directly pluck whatever it wants from the event
stream, given access to the event descriptions, either via proc files
or by including the auto-generated event structs/ids in the event
header.

Tom


Reply | Threaded
Open this post in threaded view
|

Re: static instrumentation for kernel

Mathieu Desnoyers-2
In reply to this post by Frank Ch. Eigler
* Frank Ch. Eigler ([hidden email]) wrote:

> Hi -
>
> > [...]
> > >    SYSTEMTAP_PROBE(name)
> > >    SYSTEMTAP_PROBE_N(name,arg1) // arg1 castable to int64_t numeric
> > >    SYSTEMTAP_PROBE_NS(name,arg1,arg2) // arg2 castable to char* string
> > >
>
> > It seems to somehow limit the variations of parameters that can be
> > passed as argument : one could need a particular macro for a
> > (int64_t,void*,int32_t,byte).  We can imagine thousands of
> > variations. Or maybe the goal is just to allow a small subset of
> > types ?
>
> Since systemtap uses (approximately) only two data types, the
> combinatorial explosion is not so bad.  Integers and pointers are
> castable to int64_t, and char* strings can get passed as is.  So we
> would have 2 ** (maximum-arity) macros to generate, say 256.
>

If you don't need the event logging to be efficient with regards to minimal
type sizes, it could be acceptable.

Still, I like your idea to reuse the logging functions per types. It can save
precious memory space, at the cost of a call.

>
>
> > The problem with the if() { call } scheme comes when you want to
> > deactivate the probe. [...]  When the probe is removed, a thread
> > could clearly end up calling a bad function pointer by being stalled
> > between the if and the call. [...]
>
> Thanks for pointing out that such concurrency issues have to be
> investigated.  Since session startup/shutdown is not as time critical
> as probe execution is, something as conservative as the
> djprobes-inspired IPI/quiescence could do the trick here.
>

With the same limitation concerning the preemption. I suggest that you disable
preemption explicitely around the if statement, so this probe can be inserted in
this context.

>
> > > [...] The indirect call would be somewhat slower but much simpler
> > > than a djprobe, and much faster than a kprobe.  [...]
>
> > But still slower than an inlined function. Inlining has this
> > interesting point: it doesn't have to build up the function call by
> > putting the arguments on the stack and does not do the call itself.
>
> Indeed, but if the code to be inlined is to be changeable (i.e., let
> the user write an arbitrary probe handler), then this can't work
> without kernel recompilation/rebooting each time.
>

Well, it's partly true. What about an inline function for the fast case (event
logging) and an optionnal call to a dynamically settable function for special
purposes ?

>
> > If you want to do static tracing, you should really investigate the
> > fastest solutions available.
>
> The thing is, I don't just want to do *tracing*.
>

The LTTng mechanism does a little more than just "tracing". Its logging
functions can easily used to replace your SystemTAP stp_printf mechanism which
seems to be a bit limited. From what I see in your CVS repository, this logging
function really has to be called with interrupts deactivated to be reentrant
(string.c:_stp_sprintf(). Therefore, it can't log a non maskable interrupt
handler without risks of corruption.

The principal goals of LTTng is to offer a completely reentrant logging
function that can write all types known to the C language, plus a little more.


>
> > [...]  And did I say that we have a modular trace viewer (LTTV) ?
>
> Emitting data in a compatible form could be very useful for systemtap
> users.
>

Yes, I think it would. But it could be also very interesting to see SystemTAP
log through the LTTng mechanism. It would make it able to log from execution
contexts currently non probe safe.


Mathieu


OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
Reply | Threaded
Open this post in threaded view
|

Re: static instrumentation for kernel

Martin Hunt
In reply to this post by Frank Ch. Eigler
On Tue, 2005-12-13 at 12:08 -0500, Frank Ch. Eigler wrote:

> > [...]
> > >    SYSTEMTAP_PROBE(name)
> > >    SYSTEMTAP_PROBE_N(name,arg1) // arg1 castable to int64_t numeric
> > >    SYSTEMTAP_PROBE_NS(name,arg1,arg2) // arg2 castable to char* string
> > >
>
> > It seems to somehow limit the variations of parameters that can be
> > passed as argument : one could need a particular macro for a
> > (int64_t,void*,int32_t,byte).  We can imagine thousands of
> > variations. Or maybe the goal is just to allow a small subset of
> > types ?
>
> Since systemtap uses (approximately) only two data types, the
> combinatorial explosion is not so bad.  Integers and pointers are
> castable to int64_t, and char* strings can get passed as is.  So we
> would have 2 ** (maximum-arity) macros to generate, say 256.

2**(maximum-arity + 1) - 1
But they could be easily generated; no need to write them by hand.

Martin