improved from 6963 seconds to 640 seconds. This is on top of the
improvement I obtained with my previous patch. The culprit was the
sw-profile-gprof component's use of an attribute interface to obtain the
pc from each sample. It turns out that this alone completely dominates
all other aspects of the simulation.
The patch does two things:
1) Make use of a local reference whenever 'this->stats[current_stats]'
is used more than once in a method of gprof_component. 'this->stats' is
a vector and this change gave me about a 3% improvement
2) Use a pin interface to provide the pc for each sample to the gprof
component. This was the big win.
Rather than have the gprof component obtain and parse the value of an
attribute of the cpu for each sample, instead the cpu now drives the pc
value on two pins (to handle 64 bit pc's) before driving its
Since this represents an interface change, I didn't want to commit it
with some review/approval, however, as far as I can tell, the gprof
interface is not used by any existing port.
* sidcpuutil.h (gprof_pc_pin,gprof_pc_hi_pin): New members of
(sample_gprof): Drive gprof_pc_pin and gprof_pc_hi_pin.
(get_pc,get_pc_hi): New methods od basic_cpu.
(read_watchpoint_memory): Use get_pc.
(basic_cpu): Add gprof-pc and gprof-pc-hi pins.
* gprof.cxx (target_attribute): Removed from gprof_component.
(pc_pin,pc_hi_pin): Added to gprof_component.
(bucket_size_set): Use local reference for this->stats[current_stats].
(accumulate): Likewise. Use pc_pin and pc_hi_pin instead of
target_attribute to get the pc.
(gprof_component): Add pc and pc-hi pins. Don't add value-attribute
attribute. Initialize the driven value of pc_hi_pin with 0.
+ <behavior name="profiling">
+ <p>The component can be configured to provide samples for use by the
+ <component>sw-profile-gprof</component> component in creating
+ a gmon output file. A sample is provided each time the
+ <pin>sample-gprof</pin> is driven. Each time, before the
+ <pin>sample-gprof</pin> is driven, the <pin>gprof-pc</pin> and
+ <pin>gprof-pc-hi</pin> pins will be driven in order to provide the
+ current program counter. <pin>gprof-pc</pin> represents the low order
+ 32 bits of the pc and <pin>gprof-pc-hi</pin> represents the high order
+ 32 bits of the pc (if any).
Nicely done, thanks. (If backward compatibility was at all a concern,
the gprof component could have a new "sample" pin that assumes the
pin-based PC traffic, instead of changing the current interface. But
I agree that the old interface is not worth saving.)
Was the string conversion stuff obvious in profiling output?
>Nicely done, thanks. (If backward compatibility was at all a concern,
>the gprof component could have a new "sample" pin that assumes the
>pin-based PC traffic, instead of changing the current interface. But
>I agree that the old interface is not worth saving.)
>Was the string conversion stuff obvious in profiling output?
I don't know --- I took a more brute force approach, since I knew that
the overhead was in the driving of the cpu's sample-gprof pin (the only
effect of using --gprof). I first suspected overhead in the std::map
used to collect the buckets, but that turned out to be small. I then
tried the "local reference for this->stats[current_stats]" optimization
and got the 3%. I then tried experimentally removing the collection of
the data and found no improvement. That left the parsing of the
attributes which, when experimentally removed, accounted for all of the