DTrace for Linux

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

DTrace for Linux

Stone, Joshua I
This is getting a lot of buzz on the blogosphere:
http://blogs.sun.com/roller/page/ahl?entry=dtrace_for_linux

Using BrandZ and Dtrace, he was able to root-cause some problem with the
Linux "top".

Looking through his scripts, I think that a lot of his analysis could be
done *today* with SystemTap.  There's some user-mode stuff that we can't
do, but most of it seems ok.

It would be nice if we could duplicate his results using SystemTap, to
show our current maturity...

Comments?

Josh
Reply | Threaded
Open this post in threaded view
|

Re: DTrace for Linux

Martin Hunt
On Wed, 2005-12-14 at 10:47 -0800, Stone, Joshua I wrote:
> This is getting a lot of buzz on the blogosphere:
> http://blogs.sun.com/roller/page/ahl?entry=dtrace_for_linux

OK, I gave it a quick try. To do this, I needed the system_call tapset,
which (like the rest of systemtap) is a rapidly changing work-in-
progress. Unfortunately it doesn't run of my x86_64 dev system, so I had
to switch to an x86.

First, to get a quick look at what top was doing, I ran this script:

>stap -c top sys1.stp
-------
global called

probe kernel.syscall.* {
        if (execname() == "top")
                called[name] += 1
}

probe end {
        foreach (n in called-)
                printf("top called %s\t%d times\n", n, called[n])      
}
--------

When I exited top, it gave me the following output:
top called read 1551 times
top called open 1433 times
top called creat        1433 times
top called close        1430 times
top called stat64       700 times
top called alarm        285 times
top called fcntl64      253 times
top called rt_sigaction 201 times
top called mmap2        33 times
top called fstat64      30 times
top called write        23 times
top called ioctl        20 times
top called munmap       17 times
top called lseek        15 times
top called getdents64   12 times
top called llseek       10 times
top called access       8 times
top called gettimeofday 6 times
top called select       6 times
[...]

That looks fine. And very much like what Sun got. So on to some crude
profiling:

First thing I notice is that the tapset sets "name" to different values
for the entry and exit probes. This makes matching them up a bit of a
headache. So I edited the tapset and removed the ".return" on the values
for "name" in the return probes. If no one has a reason why they should
be there then we should make this change permanent.

Unfortunately, then I got an error on from the syscall tapset on
execve.return.
semantic error: no match for probe point
         while: resolving probe point kernel.function
("sys_execve").return

So I commented out the whole sysexecve section in the tapset and tried
again.

----
global called, ttime

probe kernel.syscall.* {
        if (execname() == "top")
                called[name] = gettimeofday_us()
}
probe kernel.syscall.*.return {
        if (execname() == "top")
                ttime[name] +=  gettimeofday_us() - called[name]
}

probe end {
        foreach ([n] in ttime-)
                printf("%s took %d usecs\n", n, ttime[n])
}
----
This time I got the following output:
select took 100037096 usecs
open took 93042 usecs
read took 79974 usecs
creat took 66253 usecs
close took 30882 usecs
stat64 took 28486 usecs
write took 7171 usecs
fcntl64 took 6297 usecs
alarm took 5958 usecs
rt_sigaction took 4017 usecs
getdents64 took 2289 usecs
ioctl took 747 usecs
lseek took 353 usecs
access took 324 usecs
llseek took 253 usecs
fstat64 took 238 usecs
mmap2 took 186 usecs
munmap took 171 usecs
gettimeofday took 124 usecs
time took 117 usecs
socketcall took 76 usecs
mprotect took 30 usecs
connect took 24 usecs
socket took 16 usecs
[...]

OK, now this is much different than what Sun got. There is certainly no
munmap() problem here. Without detail about Sun's emulation environment
and what version of linux they are running, I cannot even guess what the
problem is. Maybe one clue is this odd comment "[...]what I've heard my
colleague, Bryan, refer to as the "top problem": your system is slow, so
you run top. What's the top process? Top!"  

I've never seen "top" the top process in anything but an extremely idle
system.

Martin





Reply | Threaded
Open this post in threaded view
|

Re: DTrace for Linux

Kevin Stafford
Martin Hunt wrote:

>[...]
>Unfortunately it doesn't run of my x86_64 dev system, so I had
>to switch to an x86.
>[...]
>  
>
The syscall tapset has been tested on x86_64 with kernel 2.6.9-24.ELsmp.
I found that differences in architecture, kernel version, or
configuration can
cause kernel.syscall.* { } to fail during compilation. The directory
structure
has been altered in CVS to provide tapsets only for kernels & architectures
that I have successfully instrumented & tested.

>First thing I notice is that the tapset sets "name" to different values
>for the entry and exit probes. This makes matching them up a bit of a
>headache. So I edited the tapset and removed the ".return" on the values
>for "name" in the return probes. If no one has a reason why they should
>be there then we should make this change permanent.
>  
>
This is fine so long as probe alias wildcards are limited to expanding
only within
the . (dot) characters. (See BZ # 1928)
For example: if kernel.syscall.* expands to all entry point aliases
(kernel.syscall.*)
AND all return point aliases (kernel.syscall.*.return) then reference to
"name"
would not differentiate between probes and return probes.

>Unfortunately, then I got an error on from the syscall tapset on
>execve.return.
>semantic error: no match for probe point
>         while: resolving probe point kernel.function
>("sys_execve").return
>
>So I commented out the whole sysexecve section in the tapset and tried
>again.
>  
>
This could be caused by the aforementioned differences in kernel
version, arch,
or config options. More likely, this is due to the sys_execve.return
being on the
translator blacklist. Jim submitted a patch that fixed the execve return
probe issue.
(see http://sourceware.org/ml/systemtap/2005-q4/msg00223.html)

--
Kevin Stafford
DES 2 | MS 2M3
Beaverton - OR
Linux Technology Center
IBM Systems & Technology
Phone: 1-503-578-3039
Email: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: DTrace for Linux

Andi Kleen
In reply to this post by Martin Hunt
Martin Hunt <[hidden email]> writes:
>
> OK, now this is much different than what Sun got. There is certainly no
> munmap() problem here. Without detail about Sun's emulation environment
> and what version of linux they are running, I cannot even guess what the
> problem is. Maybe one clue is this odd comment "[...]what I've heard my

If I understand BrandZ from the link above correctly it is essentially
running Linux user space on a Solaris kernel.

When munmap is a problem for them and it isn't on native Linux
it just means munmap is much slower on Solaris than on real Linux.
Now that they know they can hopefully do something about this.

However he makes a point that munmap flushes TLBs and scales
with number of CPUs. So it's possible that he ran it on a bigger
system than you and it cost more CPU time because of that.
But I don't quite buy this argument for top because top
is single threaded and Linux will never do remote TLB flushes
for a single threaded process.

It should only make a difference for multithreaded programs where
multiple threads sharing the same VM run on different CPUs.

Perhaps Solaris isn't that clever in avoiding TLB flushes(?) although
it is hard to imagine this.

-Andi
Reply | Threaded
Open this post in threaded view
|

Fwd: DTrace for Linux

James Dickens
okay this was blocked so i'm resending.


---------- Forwarded message ----------
From: James Dickens <[hidden email]>
Date: Dec 15, 2005 7:53 PM
Subject: Re: DTrace for Linux
To: Andi Kleen <[hidden email]>
Cc: Martin Hunt <[hidden email]>, "[hidden email]"
<[hidden email]>




On 16 Dec 2005 01:55:06 +0100, Andi Kleen <[hidden email]> wrote:

>  Martin Hunt <[hidden email]> writes:
> >
> > OK, now this is much different than what Sun got. There is certainly no
> > munmap() problem here. Without detail about Sun's emulation environment
> > and what version of linux they are running, I cannot even guess what the
> > problem is. Maybe one clue is this odd comment "[...]what I've heard my
>
> If I understand BrandZ from the link above correctly it is essentially
> running Linux user space on a Solaris kernel.
>
> When munmap is a problem for them and it isn't on native Linux
> it just means munmap is much slower on Solaris than on real Linux.
> Now that they know they can hopefully do something about this.
>
> However he makes a point that munmap flushes TLBs and scales
> with number of CPUs. So it's possible that he ran it on a bigger
> system than you and it cost more CPU time because of that.
> But I don't quite buy this argument for top because top
> is single threaded and Linux will never do remote TLB flushes
> for a single threaded process.
>
> It should only make a difference for multithreaded programs where
> multiple threads sharing the same VM run on different CPUs.
>
> Perhaps Solaris isn't that clever in avoiding TLB flushes(?) although
> it is hard to imagine this.

 okay this is getting to be off topic, but since you asked, and i have
a system running brandz

 Hardware:
 vmware running on a 2.613 GHz system, the vhost has  384 MB allocated to it

 OS: Solaris Express Community Express build patched to run Brandz bits.


 Brandz is running  Centos Linux that was pre-packaged on
http://www.opensolaris.org/os/community/brandz/
 -bash-2.05b# uname -av
 Linux linux1 2.4.21 BrandX fake linux i686 i686 i386 GNU/Linux
 -bash-2.05b#

 -bash-2.05b# top -v
 procps version 2.0.17
 -bash-2.05b#

 # dtrace -n lx-syscall:::entry'/execname == "top"/{ @[probefunc] = count(); }'
 dtrace: description 'lx-syscall:::entry' matched 272 probes
 ^C

   access                                                            8
   fstat64                                                           8
   gettimeofday                                                      8
   gtime                                                             8
   llseek                                                            8
   select                                                            8
   getdents64                                                       32
   lseek                                                            32
   stat64                                                           80
   rt_sigaction                                                    112
   fcntl64                                                         136
   write                                                           152
   alarm                                                           168
   close                                                           256
   read                                                            328
   open                                                            336
 #

 okay looks normal, but no munmap, maybe he was using an older version
of  brandz ( linux userland environment). But lets keep going.

 lx-sys.d
 #!/usr/sbin/dtrace -s

 lx-syscall:::entry
 /execname == "top"/
 {
         self->ts = vtimestamp;
 }

 lx-syscall:::return
 /self->ts/
 {
         @[probefunc] = sum(vtimestamp - self->ts);
         self->ts = 0;
 }


 dtrace: script './lx-sys.d' matched 544 probes
 ^C

   gettimeofday                                                  32490
   llseek                                                        47045
   access                                                        94997
   gtime                                                        111993
   fstat64                                                      129067
   lseek                                                        209131
   select                                                       276210
   alarm                                                        916786
   stat64                                                      1093732
   fcntl64                                                     1143852
   getdents64                                                  1625671
   rt_sigaction                                                2227934
   write                                                       2336432
   close                                                       2749497
   read                                                        5171047
   open                                                        5569743
 #

 okay normal,  but now  open is using the most time, this is because
for process it opens and reads data and close, if top help them open
and just read an open filehandle, it would be a lighter procces,
especailly as more and more proccesses are running.



 lx-sys2.d  #!/usr/sbin/dtrace -s

lx-syscall:::entry
/execname == "top"/
{
        self->ts = vtimestamp;
}

lx-syscall:::return
/self->ts/
{
        @[probefunc] = sum(vtimestamp - self->ts);

        @["- all syscalls -"] = sum(vtimestamp - self->ts);
        self->ts = 0;
}

sched:::on-cpu
/execname == "top"/
{
        self->on = timestamp;
}

sched:::off-cpu

/self->on/
{
        @["- total -"] = sum(timestamp - self->on);
        self->on = 0;
}


 dtrace: script './lx-sys2.d' matched 550 probes
 ^C

   getegid                                                       16240
   geteuid                                                       20344
   getgid                                                        24503
   getuid                                                        38199
   gettimeofday                                                  49445
   rt_sigprocmask                                                54397
   llseek                                                        57766
   gtime                                                         83750
   newuname                                                     115006
   brk                                                          126012
   set_thread_area                                              137075
   nanosleep                                                    200987
   access                                                       253384
   lseek                                                        280470
   mmap2                                                        284076
   munmap                                                       441192
   select                                                       544337
   fstat64                                                      595619
   old_mmap                                                     595924
   alarm                                                        891482
   ioctl                                                       1060726
   stat64                                                      1602166
   fcntl64                                                     1958110
   rt_sigaction                                                2526971
   getdents64                                                  2752676
   write                                                       3573395
   socketcall                                                  4072963
   close                                                       4954184
   open                                                        7627755
   read                                                        8364549
   - all syscalls -                                           43303703
   - total -                                                  56738308
 #


 okay the rest don't apply since either he had a wierd version of top
or some other bug that has since been fixed. Maybe a race because he
has a SMP machine and I don't.

 James Dickens
 uadmin.blogspot.com



> -Andi
>
Reply | Threaded
Open this post in threaded view
|

Re: DTrace for Linux

Andi Kleen
In reply to this post by Andi Kleen
>   gettimeofday                                                  32490
>   llseek                                                        47045
>   access                                                        94997
>   gtime                                                        111993
>   fstat64                                                      129067
>   lseek                                                        209131
>   select                                                       276210
>   alarm                                                        916786
>   stat64                                                      1093732
>   fcntl64                                                     1143852
>   getdents64                                                  1625671
>   rt_sigaction                                                2227934
>   write                                                       2336432
>   close                                                       2749497
>   read                                                        5171047
>   open                                                        5569743

This looks about expected. top uses /proc a lot and /proc is very
open/read/close/getdents64 intensive.  I suppose the writes are the console output.

Why it uses rt_sigaction that often I don't know but normally
this call should be fairly cheap - at least it is on Linux AFAIK.

Your numbers suggest it isn't quite on Solaris (it takes
a lot of time for not that many calls), perhaps
that is one area where the Sun folks can still improve.

-Andi

Reply | Threaded
Open this post in threaded view
|

Re: DTrace for Linux

weikong
In reply to this post by Martin Hunt
在 星期四 15 十二月 2005 07:32,Martin Hunt 写道:
> OK, now this is much different than what Sun got. There is certainly no
> munmap() problem here. Without detail about Sun's emulation environment
> and what version of linux they are running, I cannot even guess what the
> problem is. Maybe one clue is this odd comment "[...]what I've heard my
> colleague, Bryan, refer to as the "top problem": your system is slow, so
> you run top. What's the top process? Top!"

The result, Because some system call (such sys_select, sys_open, device
relative system call) are may yield cpu for some reason, than go schedule,
when this system call return ,the time calculate like your function may
bigger than the real time. So we must minus the yield schedule time.

I write one to calculate each process's system call time and user function
time for comparison, But there are some bug,
Hope for correct.

Attachment 1: cmpcall.stp
Attachment 2: result

-Wei Kong

cmpcall.stp (4K) Download Attachment
result (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: DTrace for Linux

Martin Hunt
On Fri, 2005-12-16 at 11:02 +0800, [hidden email] wrote:

> # BUG 1: The "log" function always throw "segment error", I do not
> konw why. So I "#" them, and add the title in the result

They work fine for me.

> #        by myself.
> # BUG 2: I can not enter the switch_to function, So I use "schedule"
> instead, but it result may not very precision.

I'll look into this. I saw something about it earlier this week.

> # BUG 3: "delete the global variable", it does not work. Let me so
> sad.

Can you explain what you are expecting and what it did?

To get the best results, you should probably filter your probes using
something like "if (target() == pid())" or "if (execname == "top"))"


>                 log( string( pid ) . "\t" . string( pid_count[ pid,
> cmd ] ) . "\t" . string( pid_time[ pid, cmd ] ) . "\t\t" . string
> ( pids[ pid, cmd ] ) . "\t\t" . string( pid_time[ pid, cmd ] - pids
> [ pid, cmd ] ) . "\t\t" . cmd )

It would be best to use printf() instead of log().


> probe timer.jiffies( 1000 )
> {
>         print_cmpcall( )
>         exit( )
> }

It you use a timer to print out a large amount of information, like you
do here, it will cause some inaccuracy in your data due to the data
structures being locked for a long time while they are printed.  If you
are only going to print once, it would be best to simply move the
print_cmpcall() to probe end.