MAXACTION exceeded error while using systemtap

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

MAXACTION exceeded error while using systemtap

Badari Pulavarty
Hi,

I am trying to collect pagecache usage info using systemtap
and I get following error, while reporting. What should I
do fix it ?

Thanks,
Badari


....
mapping = 0xffff8100b1de8a50 nrpages = 3
mapping = 0xffff8100b1dfaa50 nrpages = 2
mapping = 0xffff8100b1d03260 nrpages = 1
mapping = 0xffff8100b1d04e48 nrpages = 3
mapping = 0xffff8100b1e0ee48 nrpages = 1
mapping = 0xffff8100b1e12658 nrpages = 1
mapping = 0xffff8100b1e18a50 nrpages = 2
mapping = 0xffff8100b1e1ce48 nrpages = 1
mapping = 0xffff8100b1e2d658 nrpages = 2
mapping = 0xffff8100b1e34a50 nrpages = 1
mapping = 0xffff8100b1e40a50 nrpages = 3
mapping = 0xffff8100b1e4a658 nrpages = 1
mapping = 0xffff8100b1e9ca50 nrpages = 1
mapping = 0xffff8100b1d6ba50 nrpages = 3
mapping = 0xffff8100b1eb2a50 nrpages = 3
mapping = 0xffff8100b1db3a50 nrpages = 2
mapping = 0xffff8100b1ec2a50 nrpages = 1
mapping = 0xffff8100b1eeb658 nrpages = 1
ERROR: MAXACTION exceeded near embedded-code
at /usr/local/share/systemtap/tapset/logging.stp:9:29




pagecache.stp (660 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Frank Ch. Eigler

Badari Pulavarty <[hidden email]> writes:

> I am trying to collect pagecache usage info using systemtap
> and I get following error, while reporting. What should I
> do fix it ?
> [...]
> ERROR: MAXACTION exceeded near embedded-code
> at /usr/local/share/systemtap/tapset/logging.stp:9:29

See http://sourceware.org/bugzilla/show_bug.cgi?id=1866

- FChE
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Hien Nguyen
In reply to this post by Badari Pulavarty
Badari Pulavarty wrote:

>Hi,
>
>I am trying to collect pagecache usage info using systemtap
>and I get following error, while reporting. What should I
>do fix it ?
>
>Thanks,
>Badari
>
>
>....
>mapping = 0xffff8100b1de8a50 nrpages = 3
>mapping = 0xffff8100b1dfaa50 nrpages = 2
>mapping = 0xffff8100b1d03260 nrpages = 1
>mapping = 0xffff8100b1d04e48 nrpages = 3
>mapping = 0xffff8100b1e0ee48 nrpages = 1
>mapping = 0xffff8100b1e12658 nrpages = 1
>mapping = 0xffff8100b1e18a50 nrpages = 2
>mapping = 0xffff8100b1e1ce48 nrpages = 1
>mapping = 0xffff8100b1e2d658 nrpages = 2
>mapping = 0xffff8100b1e34a50 nrpages = 1
>mapping = 0xffff8100b1e40a50 nrpages = 3
>mapping = 0xffff8100b1e4a658 nrpages = 1
>mapping = 0xffff8100b1e9ca50 nrpages = 1
>mapping = 0xffff8100b1d6ba50 nrpages = 3
>mapping = 0xffff8100b1eb2a50 nrpages = 3
>mapping = 0xffff8100b1db3a50 nrpages = 2
>mapping = 0xffff8100b1ec2a50 nrpages = 1
>mapping = 0xffff8100b1eeb658 nrpages = 1
>ERROR: MAXACTION exceeded near embedded-code
>at /usr/local/share/systemtap/tapset/logging.stp:9:29
>
>
>
>  
>
>------------------------------------------------------------------------
>
>#! stap
>
>global page_cache_pages
>global pageadd, pagedel
>
>function _(n) { return string(n) }
>
>probe kernel.function("add_to_page_cache") {
> page_cache_pages[$mapping] = $mapping->nrpages;
> pageadd++
>}
>
>probe kernel.function("__remove_from_page_cache") {
> page_cache_pages[$page->mapping] = $page->mapping->nrpages;
> pagedel++
>}
>
>function report () {
>  foreach (mapping in page_cache_pages) {
> print("mapping = " . hexstring(mapping) .
> " nrpages = " . _(page_cache_pages[mapping]) . "\n")
>  }
>  print("Totals PageAdd = " . _(pageadd) .
> " PageDel = " . _(pagedel) . "\n")
>  delete page_cache_pages
>}
>
>probe end {
>  report()
>}
>  
>
Try to reset MAXACTION once in a while with this

function reset_maxaction () %{
        if (CONTEXT && CONTEXT->actioncount)
                CONTEXT->actioncount=0;
%}

OR try to start your script with -DMAXACTION=<somehugenumber>


Reply | Threaded
Open this post in threaded view
|

RE: MAXACTION exceeded error while using systemtap

Stone, Joshua I
In reply to this post by Badari Pulavarty
Hien Nguyen wrote:
> Try to reset MAXACTION once in a while with this
>
> function reset_maxaction () %{
>         if (CONTEXT && CONTEXT->actioncount)
>                 CONTEXT->actioncount=0;
> %}
>
> OR try to start your script with -DMAXACTION=<somehugenumber>

To me, both of these solutions are really just workarounds for what will
likely be a common problem.  The first requires guru-mode, which should
not be necessary for a simple reporting script.  The second is better,
though cumbersome.


Frank Ch. Eigler wrote:
> See http://sourceware.org/bugzilla/show_bug.cgi?id=1866

This bug was dismissed as being "behavior as designed" - but I think it
is worth questioning the design.  Do we expect this to be a common
problem?  If so, we need to find a way to make it less painful...

At the very least, we should have something along the Dtrace's printa():
http://docs.sun.com/app/docs/doc/817-6223/6mlkidlhv?a=view#chp-fmt-print
a

Their printa() only deals with aggregations (stats), but we really need
something like this for general arrays as well.

Bug #1121 seems applicable to this problem - is there any work being
done here?
http://sourceware.org/bugzilla/show_bug.cgi?id=1121


Josh
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Frank Ch. Eigler
Hi -

> > See http://sourceware.org/bugzilla/show_bug.cgi?id=1866
>
> This bug was dismissed as being "behavior as designed" - but I think it
> is worth questioning the design.  [...]

Bug #1866 links to #1884, which does that.

> At the very least, we should have something along the Dtrace's printa():
> [...] Bug #1121 seems applicable to this problem [...]

The print/printf routines have worked for a few weeks now.  I don't
know whether Graydon intends to extend them to print arrays also.  The
new print code also makes more compact the elaborate reporting
routines used thus far, and importantly, rather reduces their
statement count.

I am unsure about how to estimate the very real cost of an array-print
operator.  Calling it approximately zero would make it into a
something like a DoS vector.  Intuitively, it should be proportional
to the amount of output generated, so it relates to bug #1885.

- FChE
Reply | Threaded
Open this post in threaded view
|

RE: MAXACTION exceeded error while using systemtap

Stone, Joshua I
In reply to this post by Badari Pulavarty
Frank Ch. Eigler wrote:
> Bug #1866 links to #1884, which does that.

Oops, sorry, I interpreted the "duplicate of" message backwards...

>> At the very least, we should have something along the Dtrace's
>> printa(): [...] Bug #1121 seems applicable to this problem [...]
>
> The print/printf routines have worked for a few weeks now.  I don't
> know whether Graydon intends to extend them to print arrays also.

The bug mentions printing entire arrays - that is what I was referring
to.

> I am unsure about how to estimate the very real cost of an array-print
> operator.  Calling it approximately zero would make it into a
> something like a DoS vector.  Intuitively, it should be proportional
> to the amount of output generated, so it relates to bug #1885.

I guess it's a question of the purpose of MAXACTION.  My understanding
is that it is meant to avoid infinite loops/recursion.  In this sense,
giving printa a cost of one is ok, because the size of the array is
bound by MAXMAPENTRIES.

If you mean for MAXACTION to also provide a bound on time, then I tend
to agree with your comment #4 on 1884 - we need a way to determine a
more representative MAXACTION.


Josh
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Frank Ch. Eigler
Hi -

> [...]
> I guess it's a question of the purpose of MAXACTION.  My understanding
> is that it is meant to avoid infinite loops/recursion.  [...]
> If you mean for MAXACTION to also provide a bound on time [...]

I think we need to to do the latter also, not just the former.  For
kernel stability and usability purposes, an uninterruptible but finite
loop that lasts several seconds could be as bad as an infinite one.

- FChE
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Badari Pulavarty
In reply to this post by Frank Ch. Eigler
On Tue, 2005-12-06 at 11:59 -0500, Frank Ch. Eigler wrote:

> Badari Pulavarty <[hidden email]> writes:
>
> > I am trying to collect pagecache usage info using systemtap
> > and I get following error, while reporting. What should I
> > do fix it ?
> > [...]
> > ERROR: MAXACTION exceeded near embedded-code
> > at /usr/local/share/systemtap/tapset/logging.stp:9:29
>
> See http://sourceware.org/bugzilla/show_bug.cgi?id=1866

Okay, Thank you.

BTW, I am impressed with the NO (noticeable) overhead added
by having systemtap probes. Of course, I wasn't really doing
a performance measurement. I was doing untars, kernel compiles
with probes in add_to_page_cache() and remove_from_page() -
which gets called a lot and didn't make much difference in the
kernel compile time :)


time with probe:
(Totals PageAdd = 215326 PageDel = 17612)

real    3m40.492s
user    5m44.894s
sys     1m2.276s

time without probe:

real    3m40.085s
user    5m45.650s
sys     1m1.572s




Thanks,
Badari

Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

William Cohen
Badari Pulavarty wrote:

> On Tue, 2005-12-06 at 11:59 -0500, Frank Ch. Eigler wrote:
>
>>Badari Pulavarty <[hidden email]> writes:
>>
>>
>>>I am trying to collect pagecache usage info using systemtap
>>>and I get following error, while reporting. What should I
>>>do fix it ?
>>>[...]
>>>ERROR: MAXACTION exceeded near embedded-code
>>>at /usr/local/share/systemtap/tapset/logging.stp:9:29
>>
>>See http://sourceware.org/bugzilla/show_bug.cgi?id=1866
>
>
> Okay, Thank you.
>
> BTW, I am impressed with the NO (noticeable) overhead added
> by having systemtap probes. Of course, I wasn't really doing
> a performance measurement. I was doing untars, kernel compiles
> with probes in add_to_page_cache() and remove_from_page() -
> which gets called a lot and didn't make much difference in the
> kernel compile time :)
>
>
> time with probe:
> (Totals PageAdd = 215326 PageDel = 17612)
>
> real    3m40.492s
> user    5m44.894s
> sys     1m2.276s
>
> time without probe:
>
> real    3m40.085s
> user    5m45.650s
> sys     1m1.572s
>
>
>
>
> Thanks,
> Badari
>

Assuming there is one probe fired per PageAdd and PageDel, there would
be 232,938 probes fired during the experiment's 220 seconds. This would
average 1059 probes per second. If one assumes about 1 microsecond
overhead per probe (rough guestimage) that would be about .23 seconds
over the run, pretty small.

-Will
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Badari Pulavarty
On Wed, 2005-12-07 at 10:21 -0500, William Cohen wrote:

> Badari Pulavarty wrote:
> > On Tue, 2005-12-06 at 11:59 -0500, Frank Ch. Eigler wrote:
> >
> >>Badari Pulavarty <[hidden email]> writes:
> >>
> >>
> >>>I am trying to collect pagecache usage info using systemtap
> >>>and I get following error, while reporting. What should I
> >>>do fix it ?
> >>>[...]
> >>>ERROR: MAXACTION exceeded near embedded-code
> >>>at /usr/local/share/systemtap/tapset/logging.stp:9:29
> >>
> >>See http://sourceware.org/bugzilla/show_bug.cgi?id=1866
> >
> >
> > Okay, Thank you.
> >
> > BTW, I am impressed with the NO (noticeable) overhead added
> > by having systemtap probes. Of course, I wasn't really doing
> > a performance measurement. I was doing untars, kernel compiles
> > with probes in add_to_page_cache() and remove_from_page() -
> > which gets called a lot and didn't make much difference in the
> > kernel compile time :)
> >
> >
> > time with probe:
> > (Totals PageAdd = 215326 PageDel = 17612)
> >
> > real    3m40.492s
> > user    5m44.894s
> > sys     1m2.276s
> >
> > time without probe:
> >
> > real    3m40.085s
> > user    5m45.650s
> > sys     1m1.572s
> >
> >
> >
> >
> > Thanks,
> > Badari
> >
>
> Assuming there is one probe fired per PageAdd and PageDel, there would
> be 232,938 probes fired during the experiment's 220 seconds. This would
> average 1059 probes per second. If one assumes about 1 microsecond
> overhead per probe (rough guestimage) that would be about .23 seconds
> over the run, pretty small.

Above data is not quite correct. The probe was collecting data for
untar & kernel compile. Time was reported only for kernel compile..

So, I collected that again, just for kernel compile:

Totals PageAdd = 151774 PageDel = 17495
        (total = 169269 probes fired)

real    3m39.122s
user    5m44.446s
sys     1m1.976s


Thanks,
Badari

Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Martin Hunt
In reply to this post by Frank Ch. Eigler
Frank and I have been discussing this in various PRs for a couple weeks.

What I would like to see is MAXACTION set to a very low value in kprobes
(or jprobes or djprobes). It would be equivalent to a couple of
milliseconds, at most.

It would prevent things like sorting and printing maps, no matter what
their size. Theoretically we could allow these for very small maps, but
why would we want to? Rather than attempt to calculate how much time a
potentially time-consuming function will really take, just set it to a
large fixed value assuming the worst.  It would be best to print errors
at compile time in scripts that attempt it.

For timer events, MAXACTION would be very large. Large enough to sort
any size arrays and print maybe 1000 lines or so.

For begin and end probes, MAXACTION would be something less than
infinite so there are no infinite loops.



Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Frank Ch. Eigler
Hi -


hunt wrote:

> What I would like to see is MAXACTION set to a very low value in
> kprobes (or jprobes or djprobes). It would be equivalent to a couple
> of milliseconds, at most.

Could someone undertake an experiment to measure how long some
representative probes actually take?  If it turns out that we can fit
10**4 or even 10**5 "typical" systemtap statements within a small
amount of wall-clock time on a typical machine, then MAXACTION could
simply be bumped up to that number.

The current default MAXACTION value (1000) was picked within the
confines of the same data vacuum that we have within this discussion.


> It would prevent things like sorting and printing maps, no matter
> what their size. Theoretically we could allow these for very small
> maps, but why would we want to?

Well, because they may be necessary for some as-yet-unimagined probing
context.  The problem is not sorting per se, but the amount of time
that sorting (or anything else) takes.  If we were to impose
language-level restrictions, and someone still needed to sort/print
array pieces from within a kprobe, they'd just work around it.  They
can go and open-code a bubble sort in script language, a foreach/print
over the array.  I'm sure you agree this is not desirable.


> Rather than attempt to calculate how much time a potentially
> time-consuming function will really take [...]

I'm not suggesting that such a calculation need be precise enough to
be anything more than a trivial function of the subject map size.  It
just needs to be vaguely proportional to the cost, so we can keep to a
rough deadline.


> For timer events, MAXACTION would be very large. Large enough to
> sort any size arrays and print maybe 1000 lines or so.  For begin
> and end probes, MAXACTION would be something less than infinite so
> there are no infinite loops.

I believe this expresses the mistaken belief that deadlines are only
necessary for kprobes.


- FChE
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Martin Hunt
On Wed, 2005-12-07 at 16:09 -0500, Frank Ch. Eigler wrote:

> > It would prevent things like sorting and printing maps, no matter
> > what their size. Theoretically we could allow these for very small
> > maps, but why would we want to?
>
> Well, because they may be necessary for some as-yet-unimagined probing
> context.  The problem is not sorting per se, but the amount of time
> that sorting (or anything else) takes.  If we were to impose
> language-level restrictions, and someone still needed to sort/print
> array pieces from within a kprobe, they'd just work around it.

I think we keep repeating this discussion. You argue for a general-
purpose language which dynamically detects unsafe code and exits.  I
argue that the language is application-specific and certain unsafe
behaviors should not be permitted at all. You say this might disallow
some theoretical useful code. I say it just confuses the programmers
into writing bad code..

For example, imagine a kprobe on an infrequently used function.  When it
is finally hit, it might be nice to sort and print a stored array of
collected data. Doing it in the kprobe risks hitting MAXACTION.  Setting
a flag and having a timer event that checks the flag and dumps the data
would be a better solution.

Another acceptable solution would be to have a way to automatically
defer printing and sorting of arrays to a more acceptable time
(basically like the previous example, but without the need to create a
timer event). This is easy enough to implement internally, but the
current printing syntax would need significant changes.

> They
> can go and open-code a bubble sort in script language, a foreach/print
> over the array.  I'm sure you agree this is not desirable.

Of course not. But if they attempted that, they would still run into the
MAXACTION limit. So it would be safe to do so.

I am in no way advocating eliminating MAXACTION. Just replacing it with
a more flexible function that know what context we are in (kprobe,
timer, end probe, etc).

> > Rather than attempt to calculate how much time a potentially
> > time-consuming function will really take [...]
>
> I'm not suggesting that such a calculation need be precise enough to
> be anything more than a trivial function of the subject map size.  It
> just needs to be vaguely proportional to the cost, so we can keep to a
> rough deadline.

So we allow people to write code that will work for a while until some
arbitrary point when MAXACTION gets exceeded.  Then (if we still allow
it)  they rerun with MAXACTION set higher and higher until their kernel
crashes.

I think it is possible that on a large multiprocessor system,
determining the size of a pmap could well take longer than is safe
within a kprobe. But I don't have access to hardware to run any tests.
If fact, this is the real problem. Without access to hardware, any value
approximating how long it takes to aggregate a pmap is just a wild
guess. So I am reluctant to go to a lot of work and testing just to come
up with a better guess, which may well be wrong. And all to support
something I think is probably not safe or a good idea.

> > For timer events, MAXACTION would be very large. Large enough to
> > sort any size arrays and print maybe 1000 lines or so.  For begin
> > and end probes, MAXACTION would be something less than infinite so
> > there are no infinite loops.
>
> I believe this expresses the mistaken belief that deadlines are only
> necessary for kprobes.

I believe that contradicts everything I wrote. We are simply arguing
over how fine-grained those deadlines are and how to calculate them, are
we not?

Martin


Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Frank Ch. Eigler
Hi -


hunt wrote:

> I think we keep repeating this discussion. You argue for a general-
> purpose language which dynamically detects unsafe code and exits.  I
> argue that the language is application-specific and certain unsafe
> behaviors should not be permitted at all.

Or, in other words, for ensuring one type of safety, I believe runtime
means are sufficient with unrestricted language; you believe
probe-point-specific language restrictions are necessary and/or
sufficient (which?).


> You say this might disallow some theoretical useful code.

This is fact.

> I say it just confuses the programmers into writing bad code..

This is speculation.


> For example, imagine a kprobe on an infrequently used function.
> When it is finally hit, it might be nice to sort and print a stored
> array of collected data. Doing it in the kprobe risks hitting
> MAXACTION.  Setting a flag and having a timer event that checks the
> flag and dumps the data would be a better solution.

Really?  What if the kprobe hits more than the expected number of
times per timer poll?

Can you explain why you believe that an operation that takes the exact
same amount of time (dumping the data) is necessarily unsafe in a
kprobe and necessarily safe in a timer probe?


> Another acceptable solution would be to have a way to automatically
> defer printing and sorting of arrays to a more acceptable time [...]
> but the current printing syntax would need significant changes.

This is worth further investigation, but of course has its own
complications.  These include concurrency: this would either require
locking the to-be-sorted/printed arrays until the printing coroutine
runs, or suffer the loss of coherence, or a potentially large
array-snapshot.


> [...]  I am in no way advocating eliminating MAXACTION. Just
> replacing it with a more flexible function that know what context we
> are in (kprobe, timer, end probe, etc).

OK, but you still need to justify your belief that this more flexible
function can safely have drastically different values in those
different contexts.


> > I'm not suggesting that such a calculation need be precise enough to
> > be anything more than a trivial function of the subject map size.  It
> > just needs to be vaguely proportional to the cost, so we can keep to a
> > rough deadline.
>
> So we allow people to write code that will work for a while until some
> arbitrary point when MAXACTION gets exceeded.  Then (if we still allow
> it)  they rerun with MAXACTION set higher and higher until their kernel
> crashes.

Yes, that's possible.  Setting MAXACTION very high (for some unknown
value of "very") is tantamount to guru-mode loss of protection.


> I think it is possible that on a large multiprocessor system,
> determining the size of a pmap could well take longer than is safe
> within a kprobe. [...]

How is that?  I've already said in the relevant PR and above that we
would not need an accurate size of the hypothetically-aggregated pmap.
A plain sum over the unaggregated per-cpu arrays would be good enough.
That would take just one for loop over the number of CPUs, and might
not even require locks.


> > > For timer events, MAXACTION would be very large. Large enough to
> > > sort any size arrays and print maybe 1000 lines or so.  For begin
> > > and end probes, MAXACTION would be something less than infinite so
> > > there are no infinite loops.
> >
> > I believe this expresses the mistaken belief that deadlines are only
> > necessary for kprobes.
>
> I believe that contradicts everything I wrote.

Sorry, I could not interpret "something less than infinite" as an
endorsement of deadlines.

> We are simply arguing over how fine-grained those deadlines are and
> how to calculate them, are we not?

No.  You suggested subsetting the language for kprobe contexts, which
goes well beyond this.


- FChE
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Martin Hunt
On Wed, 2005-12-07 at 18:06 -0500, Frank Ch. Eigler wrote:
> [...]
> Or, in other words, for ensuring one type of safety, I believe runtime
> means are sufficient with unrestricted language; you believe
> probe-point-specific language restrictions are necessary and/or
> sufficient (which?).

Necessary, but not sufficient.

I understand your desire for internal elegance.  However from a
systemtap users perspective, what you propose is terrible.  You are
proposing having functions that depending on where they are used may
work some of the time, until an internal threshold (which depends on
surrounding functions, the number of cpus, and number of elements in an
array) is hit. And then your program terminates with an error!

I believe having something like MAXACTION is necessary as a check
against putting too much in a kprobe or infinite looping. However it
should trigger immediately and should not be based on dynamically
changing thresholds. And should not be user-visible at all.

Systemtap is not a general-purpose programming environment. kprobes are
very timing sensitive. They should do data collection and printing of
simple scalar data. Data analysis can/should be done in other contexts.
I see nothing wrong with documenting this and enforcing it.

> Can you explain why you believe that an operation that takes the exact
> same amount of time (dumping the data) is necessarily unsafe in a
> kprobe and necessarily safe in a timer probe?

Because timer probes can run in process context which means they sleep,
can be scheduled, take as long as they want? Whereas a kprobe might be
in the middle of a task switch.

> > Another acceptable solution would be to have a way to automatically
> > defer printing and sorting of arrays to a more acceptable time [...]
> > but the current printing syntax would need significant changes.
>
> This is worth further investigation, but of course has its own
> complications.  These include concurrency: this would either require
> locking the to-be-sorted/printed arrays until the printing coroutine
> runs, or suffer the loss of coherence, or a potentially large
> array-snapshot.

They main complication I see is
printf("my array is\n")
print(@hist_log(foo))
Doesn't do what you expect. Unless we defer all output...

>
> > [...]  I am in no way advocating eliminating MAXACTION. Just
> > replacing it with a more flexible function that know what context we
> > are in (kprobe, timer, end probe, etc).
>
> OK, but you still need to justify your belief that this more flexible
> function can safely have drastically different values in those
> different contexts.

Seriously? OK. I checked it.

There is no logical reason why begin or end probes need have any time
limit on them. Other than to prevent infinite loops, because when the
module is loaded, it ties up resources.

Currently I see timer events are implemented as kernel timers. These are
softirqs and would have some time limits and cannot sleep of course.

I took the C code from a simple script and ripped out the kernel timers
and replaced them with work queues.  Then I slept for 10 seconds in the
timer events. And they worked fine. And I summed all the numbers from 1
to 10 billion (which took about 25 seconds) in every timer event and it
worked fine.  And I did the same in probe end. And it worked fine.  And
I tried it all at once. And it worked too.

> [...]
> Sorry, I could not interpret "something less than infinite" as an
> endorsement of deadlines.

What's the MAXACTION equivalent of "2 minutes"  I'm guessing it's a very
large number.

Martin


Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

James Dickens
On 12/8/05, Martin Hunt <[hidden email]> wrote:

> On Wed, 2005-12-07 at 18:06 -0500, Frank Ch. Eigler wrote:
> > [...]
> > Or, in other words, for ensuring one type of safety, I believe runtime
> > means are sufficient with unrestricted language; you believe
> > probe-point-specific language restrictions are necessary and/or
> > sufficient (which?).
>
> Necessary, but not sufficient.
>
> I understand your desire for internal elegance.  However from a
> systemtap users perspective, what you propose is terrible.  You are
> proposing having functions that depending on where they are used may
> work some of the time, until an internal threshold (which depends on
> surrounding functions, the number of cpus, and number of elements in an
> array) is hit. And then your program terminates with an error!
>
> I believe having something like MAXACTION is necessary as a check
> against putting too much in a kprobe or infinite looping. However it
> should trigger immediately and should not be based on dynamically
> changing thresholds. And should not be user-visible at all.
>

perhaps you two, should look at how your neighbor DTrace deals with
this issue. It seems to work pretty well. They recently talked about
this on there  dtrace-discuss mailing list

http://www.opensolaris.org/jive/thread.jspa?messageID=15073&#15073

btw MAXACTION really can't work reliably what happens when the box is
underload, say with 4 gigabit nics all being flooded with data, and
you are probing a function in the fast path?

Not everyone has a box that can sum 4 billion numbers in less than a minute.

and many other reasons.

James Dickens
uadmin.blogspot.com

P.S. I would recomend the members of this list subscribe to the
dtrace-discuss list on www.opensolaris.org, its a low traffic list,
but they do discuss issues you will face as you proceed. I can assure
that the dtrace programers monitor your mailing list.


> Systemtap is not a general-purpose programming environment. kprobes are
> very timing sensitive. They should do data collection and printing of
> simple scalar data. Data analysis can/should be done in other contexts.
> I see nothing wrong with documenting this and enforcing it.
>
> > Can you explain why you believe that an operation that takes the exact
> > same amount of time (dumping the data) is necessarily unsafe in a
> > kprobe and necessarily safe in a timer probe?
>
> Because timer probes can run in process context which means they sleep,
> can be scheduled, take as long as they want? Whereas a kprobe might be
> in the middle of a task switch.
>
> > > Another acceptable solution would be to have a way to automatically
> > > defer printing and sorting of arrays to a more acceptable time [...]
> > > but the current printing syntax would need significant changes.
> >
> > This is worth further investigation, but of course has its own
> > complications.  These include concurrency: this would either require
> > locking the to-be-sorted/printed arrays until the printing coroutine
> > runs, or suffer the loss of coherence, or a potentially large
> > array-snapshot.
>
> They main complication I see is
> printf("my array is\n")
> print(@hist_log(foo))
> Doesn't do what you expect. Unless we defer all output...
>
> >
> > > [...]  I am in no way advocating eliminating MAXACTION. Just
> > > replacing it with a more flexible function that know what context we
> > > are in (kprobe, timer, end probe, etc).
> >
> > OK, but you still need to justify your belief that this more flexible
> > function can safely have drastically different values in those
> > different contexts.
>
> Seriously? OK. I checked it.
>
> There is no logical reason why begin or end probes need have any time
> limit on them. Other than to prevent infinite loops, because when the
> module is loaded, it ties up resources.
>
> Currently I see timer events are implemented as kernel timers. These are
> softirqs and would have some time limits and cannot sleep of course.
>
> I took the C code from a simple script and ripped out the kernel timers
> and replaced them with work queues.  Then I slept for 10 seconds in the
> timer events. And they worked fine. And I summed all the numbers from 1
> to 10 billion (which took about 25 seconds) in every timer event and it
> worked fine.  And I did the same in probe end. And it worked fine.  And
> I tried it all at once. And it worked too.
>
> > [...]
> > Sorry, I could not interpret "something less than infinite" as an
> > endorsement of deadlines.
>
> What's the MAXACTION equivalent of "2 minutes"  I'm guessing it's a very
> large number.
>
> Martin
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Frank Ch. Eigler
In reply to this post by Martin Hunt

hunt wrote:

> > Or, in other words, for ensuring one type of safety, I believe runtime
> > means are sufficient with unrestricted language; you believe
> > probe-point-specific language restrictions are necessary and/or
> > sufficient (which?).
>
> Necessary, but not sufficient.

But it can't be necessary either, since plain maxaction counting is by
itself sufficient to provide that same type of safety.


> [...] However from a systemtap users perspective, what you propose
> is terrible.  You are proposing having functions that depending on
> where they are used may work some of the time [...]  And then your
> program terminates with an error!

What is so terrible about an error, when it is easily explained?  Once
my bemoaned data vacuum is filled, and a more informed maxaction
default is set (perhaps on a per-probe-point family basis), we may not
actually encounter it in realistic scenarios.


> I believe having something like MAXACTION is necessary as a check
> against putting too much in a kprobe or infinite looping. However it
> should trigger immediately and should not be based on dynamically
> changing thresholds. And should not be user-visible at all.

What does that mean?  Triggering "immediately" is impossible for
computability reasons.  How do you imagine aborted probes should be
presented if they are "not user-visible"?  We shouldn't notify the
user that their code did not run to completion?


> Systemtap is not a general-purpose programming environment. kprobes
> are very timing sensitive. They should do data collection and
> printing of simple scalar data. Data analysis can/should be done in
> other contexts.  I see nothing wrong with documenting this and
> enforcing it.

Please be specific by what "this" is that you wish to permit kprobes.
Something loop-free straitjacketed like dtrace?  If not, then the
language restrictions don't matter (since sorting / array printing can
be reduced to scalar operations, as we've already covered).


> > Can you explain why you believe that an operation that takes the exact
> > same amount of time (dumping the data) is necessarily unsafe in a
> > kprobe and necessarily safe in a timer probe?
>
> Because timer probes can run in process context which means they
> sleep, can be scheduled, take as long as they want?  [...]

You're getting hung up on an accident of implementation.  Probe
handlers are not supposed to be able to sleep (be blocking /
interupted / scheduled).  This was written on day one.  The fact that
timers and begin/end probes don't behave this way is a bug.  Reopening
this issue is possible, but needs better justification than wanting to
run ten-second loops in kernel space.


- FChE
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Frank Ch. Eigler
In reply to this post by James Dickens
Hi -

jamesd.wi wrote:

> [...]  perhaps you two, should look at how your neighbor DTrace
> deals with this issue. It seems to work pretty well.

Their situation is a little simpler because the language forbids
looping and recursion, so the individual handler run times are already
statically bounded.


> [...]  btw MAXACTION really can't work reliably what happens when
> the box is underload, say with 4 gigabit nics all being flooded with
> data, and you are probing a function in the fast path?

Yes, this is a good point.  A system-level watchdog would be useful.


> P.S. I would recomend the members of this list subscribe to the
> dtrace-discuss list on www.opensolaris.org, its a low traffic list,
> but they do discuss issues you will face as you proceed. I can
> assure that the dtrace programers monitor your mailing list.

Good point.  Welcome all, and feel free to delurk.


- FChE
Reply | Threaded
Open this post in threaded view
|

RE: MAXACTION exceeded error while using systemtap

Stone, Joshua I
In reply to this post by Badari Pulavarty
Martin Hunt wrote:
> Because timer probes can run in process context which means they
> sleep, can be scheduled, take as long as they want? Whereas a kprobe
> might be in the middle of a task switch.

I just want to point out that this is not true of the timer.profile
variety, which runs (by necessity) in a true interrupt context.

Josh
Reply | Threaded
Open this post in threaded view
|

Re: MAXACTION exceeded error while using systemtap

Marcelo Tosatti
In reply to this post by Frank Ch. Eigler
On Thu, Dec 08, 2005 at 08:57:46AM -0500, Frank Ch. Eigler wrote:

> Hi -
>
> jamesd.wi wrote:
>
> > [...]  perhaps you two, should look at how your neighbor DTrace
> > deals with this issue. It seems to work pretty well.
>
> Their situation is a little simpler because the language forbids
> looping and recursion, so the individual handler run times are already
> statically bounded.
>
>
> > [...]  btw MAXACTION really can't work reliably what happens when
> > the box is underload, say with 4 gigabit nics all being flooded with
> > data, and you are probing a function in the fast path?
>
> Yes, this is a good point.  A system-level watchdog would be useful.
>
>
> > P.S. I would recomend the members of this list subscribe to the
> > dtrace-discuss list on www.opensolaris.org, its a low traffic list,
> > but they do discuss issues you will face as you proceed. I can
> > assure that the dtrace programers monitor your mailing list.
>
> Good point.  Welcome all, and feel free to delurk.

Hi folks,

I think that SystemTap needs some method to periodically (or when memory
pressure arrives) dump the parts of the array to userspace via
relayfs.

relayfs already supports such mechanism with its "sub-buffer" structure,
where the userspace reader is allowed to read full buffers (thus freeing
them afterwards) while the hooks continue to push data to empty or
partially filled buffers.

James mentions that

"btw MAXACTION really can't work reliably what happens when the box is
underload, say with 4 gigabit nics all being flooded with data, and
you are probing a function in the fast path?

Not everyone has a box that can sum 4 billion numbers in less than a minute."

relayfs works in two modes once its memory pool is full:

- panic's
- drops new entries

I imagine that by using relayfs to send data to userspace one should
automatically get those very important features.

Hope to be on the same page!


12