sort a foreach on a stat value?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

sort a foreach on a stat value?

Stone, Joshua I
Do we have a syntax to sort a foreach based on a stat value?  Obviously
this doesn't make sense for a histogram, but for the others, like
@count, it seems that this would very often be what a user would want to
do.

For example - here's an unscalable way to do simple thread profiling:

global stat
probe my.event { ++stat[tid()] }
probe end {
  foreach(tid in stat-) // sort by value -> most active
    printf("%d: %d\n", tid, stat[tid])
}

The scalable, per-cpu way to do this is:

global stat
probe my.event { stat[tid()]<<<1 }
probe end {
  foreach(tid in stat-) // sort by value -> ???
    printf("%d: %d\n", tid, @count(stat[tid]))
}

But there doesn't seem to be a way to achieve the same sorting.

Along the same lines, it would be extremely useful to be able to do
"cascading" sort - i.e. sort by more than one field.


Josh
Reply | Threaded
Open this post in threaded view
|

Re: sort a foreach on a stat value?

Frank Ch. Eigler

joshua.i.stone wrote:

> Do we have a syntax to sort a foreach based on a stat value?  [...]

Not at this time.


> [...]
> The scalable, per-cpu way to do this is:
>
> global stat
> probe my.event { stat[tid()]<<<1 }
> probe end {
>   foreach(tid in stat-) // sort by value -> ???
>     printf("%d: %d\n", tid, @count(stat[tid]))
> }

Since the reporting phase does not need to be as "scalable" (fast) as
the accumulation phase, perhaps one could do this thusly today:

 global stat, stat_counts
 probe my.event { stat[tid()]<<<1 }
 probe end {
   foreach (tid in stat) // sort by value -> ???
     stat_counts[tid] = @count(stat[tid])
   foreach (tid in stat_counts-)
     printf("%d: %d\n", tid, stat_counts[tid]) # and/or @avg(stat[tid])) etc.
 }


> Along the same lines, it would be extremely useful to be able to do
> "cascading" sort - i.e. sort by more than one field.

If the runtime provided such an facility, one might imagine the script syntax
exposing it thusly:
  foreach ([x1+, x2--, y2+++] in array----) { ... }
encoding the cascading order in the length of those +/- suffixes.

- FChE
Reply | Threaded
Open this post in threaded view
|

RE: sort a foreach on a stat value?

Stone, Joshua I
In reply to this post by Stone, Joshua I
[hidden email] wrote:
>> Do we have a syntax to sort a foreach based on a stat value?  [...]
>
> Not at this time.

One thing I've noticed is that our foreach syntax has different
semantics than other languages, and this might be hampering us a bit.
In at least perl and bash, the only iterator variable is the value, but
our iterator variables are keys.  This prevents us from doing something
like:

        foreach (cnt- in @count(stat)) {...}

But there is value in knowing the keys as well.  Perhaps what we need is
to allow specifying a value iterator as well, so we can have:

        foreach ([key1, key2, value+] in mymap) {...}

... which also prevents the cost of re-indexing the map.  And perhaps
with stats, it could be something like:

        foreach ([tid, c=@count-, a=@avg++, h=@hist_log] in mystats)
{...}

For more semantic clarity, perhaps a semicolon would divide the key
iterators from the value/stat iterators.  And of course, sorting by a
histogram should not be allowed.

Just some ideas, comments encouraged...


> Since the reporting phase does not need to be as "scalable" (fast) as
> the accumulation phase, perhaps one could do this thusly today:
>
>  global stat, stat_counts
>  probe my.event { stat[tid()]<<<1 }
>  probe end {
>    foreach (tid in stat) // sort by value -> ???
>      stat_counts[tid] = @count(stat[tid])
>    foreach (tid in stat_counts-)
>      printf("%d: %d\n", tid, stat_counts[tid]) # and/or
>  @avg(stat[tid])) etc. }

This is a passable workaround, yes.  The downside is that if stat were
very large, I would have to fudge with the maxaction counter.  If I was
only interested in maybe the top 20, then with a single loop construct
it's easy to break out after 20 and not hit the MAXACTION boundary.
With two loops, I have to iterate the entire map the first time.


>> Along the same lines, it would be extremely useful to be able to do
>> "cascading" sort - i.e. sort by more than one field.
>
> If the runtime provided such an facility, one might imagine the
> script syntax exposing it thusly:
>   foreach ([x1+, x2--, y2+++] in array----) { ... }
> encoding the cascading order in the length of those +/- suffixes.

That's not a bad suggestion, though I think it's not obvious in which
order the cascading happens.  Does + sort before ++?  Or does more
length imply higher sorting precedence?  I think you mean the former...


Josh

Reply | Threaded
Open this post in threaded view
|

Re: sort a foreach on a stat value?

Frank Ch. Eigler
Hi -

joshua.i.stone wrote:

> [...]  One thing I've noticed is that our foreach syntax has
> different semantics than other languages [...]

Indeed, just like in awk, we iterate over indexes rather than values.


> [...]
> foreach ([tid, c=@count-, a=@avg++, h=@hist_log] in mystats)
> [...]

That sort of thing has some promise at abbreviating that excessive
duplication hunt made an example of in bug #2115.  

While this does not address sorting, another related syntactical
possibility is to infer a "[idx1, idx2]" suffix on undecorated
occurrences of the indexed array within the body of a foreach:

   foreach ([x,y] in thingie)
     total += thingie # implied [x,y]

   foreach ([x,y,z] in mystats)
     printf("%d %d %d", @count(mystats), @sum(mystats), @min(mystats))

The latter could be abbreviated further to "@count, @sum, @min", to
infer the innermost-looped array itself, plus its index tuple.

A later independent optimization could make sure that the translator
does not emit duplicate array-lookup operations within loops.


> [...]
> >    foreach (tid in stat) // sort by value -> ???
> >      stat_counts[tid] = @count(stat[tid])
> >    foreach (tid in stat_counts-)
> >      printf("%d: %d\n", tid, stat_counts[tid]) # and/or
> >  @avg(stat[tid])) etc. }
> [...]
> This is a passable workaround, yes.  The downside is that if stat were
> very large, I would have to fudge with the maxaction counter.  If I was
> only interested in maybe the top 20, then with a single loop construct
> it's easy to break out after 20 and not hit the MAXACTION boundary.

Unless I'm mistaken, the current runtime aggregates the whole pmap for
loops/sorting, even if you want just the top 20.  This cost will be
fully reflected in activity count (bug #1885) at some point.  It is
unlikely to cost much less than the explicit copying loop above.

I wonder if this behavior makes sorting on statistical values
sufficiently inefficient that special syntax is not sufficiently
justified at this point, given that open-coding is possible.


> >> Along the same lines, it would be extremely useful to be able to do
> >> "cascading" sort - i.e. sort by more than one field.
> >
> > [...]
> >   foreach ([x1+, x2--, y2+++] in array----) { ... }
>
> That's not a bad suggestion, though I think it's not obvious in which
> order the cascading happens.  [...]

I guess we'd pick and document one of the two interpretations.


- FChE
Reply | Threaded
Open this post in threaded view
|

Language design (was Re: sort a foreach on a stat value?)

Martin Hunt
My I ask that before we start adding more and more features to the
language, that we pause for a moment and attempt to determine if they
are really necessary?  I never did see any justification for "foreach"
at all.  Is is too late to try to make the language small, easy to use,
efficient and ideal for the collection of data from probe points,
without adding in a bunch of general purpose features that will
encourage programmers to do things they probably shouldn't?

If you really want really fancy post-processing, instead of beefing up
our language, a far better solution would be to pipe the through a
different interpreter, like perl.

On Mon, 2006-01-16 at 15:43 -0500, Frank Ch. Eigler wrote:

> Hi -
>
> joshua.i.stone wrote:
>
> > [...]  One thing I've noticed is that our foreach syntax has
> > different semantics than other languages [...]
>
> Indeed, just like in awk, we iterate over indexes rather than values.
>
>
> > [...]
> > foreach ([tid, c=@count-, a=@avg++, h=@hist_log] in mystats)
> > [...]
>
> That sort of thing has some promise at abbreviating that excessive
> duplication hunt made an example of in bug #2115.  
>
> While this does not address sorting, another related syntactical
> possibility is to infer a "[idx1, idx2]" suffix on undecorated
> occurrences of the indexed array within the body of a foreach:
>
>    foreach ([x,y] in thingie)
>      total += thingie # implied [x,y]
>
>    foreach ([x,y,z] in mystats)
>      printf("%d %d %d", @count(mystats), @sum(mystats), @min(mystats))
>
> The latter could be abbreviated further to "@count, @sum, @min", to
> infer the innermost-looped array itself, plus its index tuple.
>
> A later independent optimization could make sure that the translator
> does not emit duplicate array-lookup operations within loops.
>
>
> > [...]
> > >    foreach (tid in stat) // sort by value -> ???
> > >      stat_counts[tid] = @count(stat[tid])
> > >    foreach (tid in stat_counts-)
> > >      printf("%d: %d\n", tid, stat_counts[tid]) # and/or
> > >  @avg(stat[tid])) etc. }
> > [...]
> > This is a passable workaround, yes.  The downside is that if stat were
> > very large, I would have to fudge with the maxaction counter.  If I was
> > only interested in maybe the top 20, then with a single loop construct
> > it's easy to break out after 20 and not hit the MAXACTION boundary.
>
> Unless I'm mistaken, the current runtime aggregates the whole pmap for
> loops/sorting, even if you want just the top 20.  This cost will be
> fully reflected in activity count (bug #1885) at some point.  It is
> unlikely to cost much less than the explicit copying loop above.
>
> I wonder if this behavior makes sorting on statistical values
> sufficiently inefficient that special syntax is not sufficiently
> justified at this point, given that open-coding is possible.
>
>
> > >> Along the same lines, it would be extremely useful to be able to do
> > >> "cascading" sort - i.e. sort by more than one field.
> > >
> > > [...]
> > >   foreach ([x1+, x2--, y2+++] in array----) { ... }
> >
> > That's not a bad suggestion, though I think it's not obvious in which
> > order the cascading happens.  [...]
>
> I guess we'd pick and document one of the two interpretations.
>
>
> - FChE

Reply | Threaded
Open this post in threaded view
|

Re: Language design (was Re: sort a foreach on a stat value?)

Frank Ch. Eigler

hunt wrote:

> [...] Is is too late to try to make the language small, easy to use,
> efficient [...]

Such pleasant-sounding directives don't by themselves advance debate.
Like apple pie and fatherhood, everyone wants those things.  They help
contrast individual proposals, but that's what we've been doing.  If
you have other ideas, that's great, bring them on.  Even better if
they consider broader system issues rather than just language syntax.

> If you really want really fancy post-processing, instead of beefing
> up our language, a far better solution would be to pipe the through
> a different interpreter, like perl.

While that opinion is embodied by dtrace, I suspect you will find not
everyone is quite so enthused about the requirement to write
nontrivial analysis scripts in *two* languages.  You will also find
that there is no obligation to use the foreach/if/for control
constructs to richly process one's data.  For those wishing to do it
like dtrace, what obstacles do you believe we interpose?


- FChE
Reply | Threaded
Open this post in threaded view
|

Re: Language design (was Re: sort a foreach on a stat value?)

Martin Hunt
On Mon, 2006-01-16 at 16:48 -0500, Frank Ch. Eigler wrote:
> hunt wrote:
>
> > [...] Is is too late to try to make the language small, easy to use,
> > efficient [...]
> Like apple pie and fatherhood, everyone wants those things.  

Apparently not. I haven't seen any discussion on the issue, just
discussion on how to implement yet another feature in foreach.
I see too much emphasis on new features and not enough on efficiency and
ease-of-use.

> They help
> contrast individual proposals, but that's what we've been doing.  If
> you have other ideas, that's great, bring them on.  Even better if
> they consider broader system issues rather than just language syntax.

That is exactly what I would like to see. More thought to improving the
whole systemtap experience.

> > If you really want really fancy post-processing, instead of beefing
> > up our language, a far better solution would be to pipe the through
> > a different interpreter, like perl.
>
> While that opinion is embodied by dtrace, I suspect you will find not
> everyone is quite so enthused about the requirement to write
> nontrivial analysis scripts in *two* languages.  

How do you know? Have we done a survey? Do you think the systemtap
language should be:
A. Very small, simple, fast and allow you to do post-processing in your
favorite scripting language, like perl, python, tcl, etc. Available RSN.
B. Large, complex, slow and include everything you could possible need
including a builtin email client. Available 2008.

or maybe we start with A and expand the language only if users tell us
there are clear advantages to doing so.

> You will also find
> that there is no obligation to use the foreach/if/for control
> constructs to richly process one's data.  For those wishing to do it
> like dtrace, what obstacles do you believe we interpose?

Of course in a complex language, features can be ignored. But consider
how many months of development time went into the foreach statement and
how it continues to mutate. Each new feature added pushes the release
date for 1.0 back and adds complexity to the language, making it less
desireable to those who just want to use systemtap without learning
another large complex language.

Martin


Reply | Threaded
Open this post in threaded view
|

Re: Language design (was Re: sort a foreach on a stat value?)

Frank Ch. Eigler
Hi -

> > > [...] Is is too late to try to make the language small, easy to use,
> > > efficient [...]
> > Like apple pie and fatherhood, everyone wants those things.  
>
> Apparently not. [...]  I see too much emphasis on new features and
> not enough on efficiency and ease-of-use.

I do not share your perceptions.  For example, the various foreach
suggestions improve ease-of-use and have no effect on efficiency.

> [...] Do you think the systemtap language should be:
> A. Very small, simple, fast and allow you to do post-processing in your
> favorite scripting language, like perl, python, tcl, etc. Available RSN.
> B. Large, complex, slow and include everything you could possible need
> including a builtin email client. Available 2008.

We all understand that language design is full of trade-offs, and for
that matter so is system design and implementation.  I believe we have
steered a moderate course so far.  Please keep sharing insight
regarding particular issues, but hold such undirected hyperbole.


> or maybe we start with A and expand the language only if users tell
> us there are clear advantages to doing so. [...]

At this point, we are both users and developers.  Some of the
extensions being contemplated originated from reasonable use
scenarios.  Some of them didn't, and if you check the history, those
tend to be the ones I've been resisting.


- FChE
Reply | Threaded
Open this post in threaded view
|

RE: sort a foreach on a stat value?

Stone, Joshua I
In reply to this post by Stone, Joshua I
Frank Ch. Eigler wrote:

>> foreach ([tid, c=@count-, a=@avg++, h=@hist_log] in mystats)
>
> That sort of thing has some promise at abbreviating that excessive
> duplication hunt made an example of in bug #2115.
>
> While this does not address sorting, another related syntactical
> possibility is to infer a "[idx1, idx2]" suffix on undecorated
> occurrences of the indexed array within the body of a foreach:
>
>    foreach ([x,y] in thingie)
>      total += thingie # implied [x,y]
>
>    foreach ([x,y,z] in mystats)
>      printf("%d %d %d", @count(mystats), @sum(mystats), @min(mystats))
>
> The latter could be abbreviated further to "@count, @sum, @min", to
> infer the innermost-looped array itself, plus its index tuple.
>
> A later independent optimization could make sure that the translator
> does not emit duplicate array-lookup operations within loops.

Inferring like this would go a long way toward simplifying the language
(from a user perspective).  And as you mention, it leaves more
opportunity for optimization.

This can be taken even further if you allow special format characters
(as in _stp_stat_print) - Then your second example could be just:

   foreach ([x,y,z] in mystats)
     printf("%C %S %m", mystats)

At this point we're not too far from a printa, but manually doing a
foreach lets you sort it.

> Unless I'm mistaken, the current runtime aggregates the whole pmap for
> loops/sorting, even if you want just the top 20.  This cost will be
> fully reflected in activity count (bug #1885) at some point.  It is
> unlikely to cost much less than the explicit copying loop above.

Yes, it does fully aggregate when you do the foreach.  As for cost, I
suspect that will depend on the keys - if you're indexing by one or more
strings I would expect the savings to be more significant.

> I wonder if this behavior makes sorting on statistical values
> sufficiently inefficient that special syntax is not sufficiently
> justified at this point, given that open-coding is possible.

Perhaps, but if we can find a "nice" way to express the sorting it may
still be worth pursuing...

>>>> Along the same lines, it would be extremely useful to be able to do
>>>> "cascading" sort - i.e. sort by more than one field.
>>>
>>> [...]
>>>   foreach ([x1+, x2--, y2+++] in array----) { ... }
>>
>> That's not a bad suggestion, though I think it's not obvious in which
>> order the cascading happens.  [...]
>
> I guess we'd pick and document one of the two interpretations.

Another possible syntax choice could be:

  foreach([x, y, z] in array sorted [@count-, x+, z-])

Thus the sorting order is explicit.  Ascending (+) could be implied.


Josh