[patch][rfa] --gprof Performance Improvement

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[patch][rfa] --gprof Performance Improvement

Dave Brolley-2
Hi,

I'd like to commit the attached patch which improves the performance of
the --gprof option by an order of magnitude. One benchmark I tried with

  sid --board=<board> --final-insn-count --gprof=<file>,cycles=1

improved from 6963 seconds to 640 seconds. This is on top of the
improvement I obtained with my previous patch. The culprit was the
sw-profile-gprof component's use of an attribute interface to obtain the
pc from each sample. It turns out that this alone completely dominates
all other aspects of the simulation.

The patch does two things:

1) Make use of a local reference whenever 'this->stats[current_stats]'
is used more than once in a method of gprof_component. 'this->stats' is
a vector and this change gave me about a 3% improvement

2) Use a pin interface to provide the pc for each sample to the gprof
component. This was the big win.

Rather than have the gprof component obtain and parse the value of an
attribute of the cpu for each sample, instead the cpu now drives the pc
value on two pins (to handle 64 bit pc's) before driving its
sample-gprof pin.

Since this represents an interface change, I didn't want to commit it
with some review/approval, however, as far as I can tell, the gprof
interface is not used by any existing port.

Comments? OK to commit?

Dave

2006-06-26  Dave Brolley  <[hidden email]>

        * commonCfg.cxx (GprofCfg): Connect the cpu's gprof-pc and gprof-pc-hi
        pins to out pc and p-hi pins respectively.

2006-06-26  Dave Brolley  <[hidden email]>

        * sidcpuutil.h (gprof_pc_pin,gprof_pc_hi_pin): New members of
        basic_cpu.
        (sample_gprof): Drive gprof_pc_pin and gprof_pc_hi_pin.
        (get_pc,get_pc_hi): New methods od basic_cpu.
        (read_watchpoint_memory): Use get_pc.
        (basic_cpu): Add gprof-pc and gprof-pc-hi pins.

2006-06-26  Dave Brolley  <[hidden email]>

        * gprof.cxx (target_attribute): Removed from gprof_component.
        (pc_pin,pc_hi_pin): Added to gprof_component.
        (bucket_size_set): Use local reference for this->stats[current_stats].
        (accumulate_call): Likewise.
        (accumulate): Likewise. Use pc_pin and pc_hi_pin instead of
        target_attribute to get the pc.
        (gprof_component): Add pc and pc-hi pins. Don't add value-attribute
        attribute. Initialize the driven value of pc_hi_pin with 0.

2006-06-26  Dave Brolley  <[hidden email]>

        * xstormy16.h (get_pc): New member of xstormy16_cpu.

2006-06-26  Dave Brolley  <[hidden email]>

        * mt.h (get_pc): New member of mt_cpu.

2006-06-26  Dave Brolley  <[hidden email]>

        * m32rbf.h (get_pc): New member of m32rbf_cpu.

2006-06-26  Dave Brolley  <[hidden email]>

        * arm7f.h (get_pc): New member of arm7f_cpu.
        * hw-cpu-arm7t.txt: Regenerated.

2006-06-26  Dave Brolley  <[hidden email]>

        * common-xml/behavior.xml (profiling): New behavior documented.
        * common-xml/interface.xml (pins): Document gprof-pc, gprof-pc-hi
        and sample-gprof pins.



Index: sid/component/cgen-cpu/arm7t/arm7f.h
===================================================================
RCS file: /cvs/src/src/sid/component/cgen-cpu/arm7t/arm7f.h,v
retrieving revision 1.4
diff -c -p -r1.4 arm7f.h
*** sid/component/cgen-cpu/arm7t/arm7f.h 4 Aug 2004 15:42:00 -0000 1.4
--- sid/component/cgen-cpu/arm7t/arm7f.h 26 Jun 2006 19:26:09 -0000
*************** private:
*** 130,135 ****
--- 130,141 ----
        this->h_pc_set ((PCADDR) v);
      }
 
+   host_int_4
+   get_pc ()
+     {
+       return this->h_pc_get ();
+     }
+
    // debug support routines
    string dbg_get_reg (host_int_4 n);
    component::status dbg_set_reg (host_int_4 n, const string& s);
Index: sid/component/cgen-cpu/common-xml/behavior.xml
===================================================================
RCS file: /cvs/src/src/sid/component/cgen-cpu/common-xml/behavior.xml,v
retrieving revision 1.5
diff -c -p -r1.5 behavior.xml
*** sid/component/cgen-cpu/common-xml/behavior.xml 22 Mar 2004 21:38:38 -0000 1.5
--- sid/component/cgen-cpu/common-xml/behavior.xml 26 Jun 2006 19:26:09 -0000
***************
*** 79,84 ****
--- 79,98 ----
        stream.</p>
      </behavior>
 
+     <behavior name="profiling">
+
+       <p>The component can be configured to provide samples for use by the
+       <component>sw-profile-gprof</component> component in creating
+       a gmon output file. A sample is provided each time the
+       <pin>sample-gprof</pin> is driven. Each time, before the
+       <pin>sample-gprof</pin> is driven, the <pin>gprof-pc</pin> and
+       <pin>gprof-pc-hi</pin> pins will be driven in order to provide the
+       current program counter. <pin>gprof-pc</pin> represents the low order
+       32 bits of the pc and <pin>gprof-pc-hi</pin> represents the high order
+       32 bits of the pc (if any).
+       </p>
+     </behavior>
+
      <behavior name="exceptions/traps">
 
        <p>When encountering exception/trap conditions such as memory
Index: sid/component/cgen-cpu/common-xml/interface.xml
===================================================================
RCS file: /cvs/src/src/sid/component/cgen-cpu/common-xml/interface.xml,v
retrieving revision 1.7
diff -c -p -r1.7 interface.xml
*** sid/component/cgen-cpu/common-xml/interface.xml 22 Mar 2004 21:38:38 -0000 1.7
--- sid/component/cgen-cpu/common-xml/interface.xml 26 Jun 2006 19:26:09 -0000
***************
*** 13,18 ****
--- 13,21 ----
      <defpin name="flush-icache" direction="in" legalvalues="any" behaviors="execution" />
      <defpin name="print-insn-summary!" direction="in" legalvalues="any" behaviors="tracing" />
      <defpin name="trace" direction="in" legalvalues="any" behaviors="tracing" />
+     <defpin name="gprof-pc" direction="out" legalvalues="integer" behaviors="profiling" />
+     <defpin name="gprof-pc-hi" direction="out" legalvalues="integer" behaviors="profiling" />
+     <defpin name="sample-gprof" direction="out" legalvalues="positive integer" behaviors="profiling" />
 
      <!-- accessors -->
      <defaccessor name="data-memory" accesses="any" behaviors="execution" />
Index: sid/component/cgen-cpu/m32r/m32rbf.h
===================================================================
RCS file: /cvs/src/src/sid/component/cgen-cpu/m32r/m32rbf.h,v
retrieving revision 1.2
diff -c -p -r1.2 m32rbf.h
*** sid/component/cgen-cpu/m32r/m32rbf.h 3 Aug 2001 06:02:43 -0000 1.2
--- sid/component/cgen-cpu/m32r/m32rbf.h 26 Jun 2006 19:26:09 -0000
*************** public:
*** 56,61 ****
--- 56,67 ----
        this->hardware.h_pc = (PCADDR) v;
      }
 
+   host_int_4
+   get_pc ()
+     {
+       return this->hardware.h_pc;
+     }
+
    // Called by semantic code to perform branches.
    inline void
    branch (PCADDR new_pc, PCADDR& npc, sem_status& status)
Index: sid/component/cgen-cpu/mt/mt.h
===================================================================
RCS file: /cvs/src/src/sid/component/cgen-cpu/mt/mt.h,v
retrieving revision 1.2
diff -c -p -r1.2 mt.h
*** sid/component/cgen-cpu/mt/mt.h 16 Dec 2005 10:23:13 -0000 1.2
--- sid/component/cgen-cpu/mt/mt.h 26 Jun 2006 19:26:09 -0000
*************** namespace mt
*** 128,133 ****
--- 128,138 ----
   this->hardware.h_pc = (PCADDR) v;
  }
       
+       host_int_4 get_pc ()
+         {
+  return this->hardware.h_pc;
+ }
+
        // Called by semantic code to perform branches.
        inline void
        branch (PCADDR new_pc, PCADDR& npc, sem_status& status)
Index: sid/component/cgen-cpu/xstormy16/xstormy16.h
===================================================================
RCS file: /cvs/src/src/sid/component/cgen-cpu/xstormy16/xstormy16.h,v
retrieving revision 1.2
diff -c -p -r1.2 xstormy16.h
*** sid/component/cgen-cpu/xstormy16/xstormy16.h 11 Jan 2002 07:25:03 -0000 1.2
--- sid/component/cgen-cpu/xstormy16/xstormy16.h 26 Jun 2006 19:26:09 -0000
*************** namespace xstormy16
*** 48,53 ****
--- 48,58 ----
   this->hardware.h_pc = (PCADDR) v;
  }
 
+       host_int_4 get_pc ()
+         {
+  return this->hardware.h_pc;
+ }
+
        // syscall temporary registers
        USI syscall_arg0;
        USI syscall_arg1;
Index: sid/component/profiling/gprof.cxx
===================================================================
RCS file: /cvs/src/src/sid/component/profiling/gprof.cxx,v
retrieving revision 1.16
diff -c -p -r1.16 gprof.cxx
*** sid/component/profiling/gprof.cxx 14 Jun 2006 20:38:56 -0000 1.16
--- sid/component/profiling/gprof.cxx 26 Jun 2006 19:26:10 -0000
*************** namespace profiling_components
*** 135,147 ****
      vector<statistics> stats;
      unsigned current_stats;
 
-     string target_attribute;
      component* target_component;
 
      endian output_file_format;
 
      callback_pin<gprof_component> accumulate_pin;
 
      input_pin cg_caller_hi_pin;
      input_pin cg_caller_pin;
      input_pin cg_callee_hi_pin;
--- 135,148 ----
      vector<statistics> stats;
      unsigned current_stats;
 
      component* target_component;
 
      endian output_file_format;
 
      callback_pin<gprof_component> accumulate_pin;
 
+     input_pin pc_hi_pin;
+     input_pin pc_pin;
      input_pin cg_caller_hi_pin;
      input_pin cg_caller_pin;
      input_pin cg_callee_hi_pin;
*************** namespace profiling_components
*** 214,221 ****
  if (s != component::ok) return s;
 
  // Reject change if we already have samples
! if ((this->stats[current_stats].value_count != 0) &&
!    (this->stats[current_stats].bucket_size != new_bucket_size))
   {
     cerr << "sw-profile-gprof: invalid time to change bucket size" << endl;
     return component::bad_value;
--- 215,223 ----
  if (s != component::ok) return s;
 
  // Reject change if we already have samples
! statistics &st = this->stats[current_stats];
! if ((st.value_count != 0) &&
!    (st.bucket_size != new_bucket_size))
   {
     cerr << "sw-profile-gprof: invalid time to change bucket size" << endl;
     return component::bad_value;
*************** namespace profiling_components
*** 228,234 ****
     return component::bad_value;
   }
 
! this->stats[current_stats].bucket_size = new_bucket_size;
  return component::ok;
        }
 
--- 230,236 ----
     return component::bad_value;
   }
 
! st.bucket_size = new_bucket_size;
  return component::ok;
        }
 
*************** namespace profiling_components
*** 273,303 ****
        {
  if (! this->target_component) return;
 
! string value_str = this->target_component->attribute_value (this->target_attribute);
! host_int_8 value;
! component::status s = parse_attribute (value_str, value);
! if (s != component::ok) return;
!
! value_str = this->target_component->attribute_value (this->target_attribute + "-hi");
! host_int_8 value_hi;
! s = parse_attribute (value_str, value_hi);
! if (s != component::ok)
!  value_hi = 0;
!
! value = (value_hi << 32) | (value & 0xffffffff);
 
  // std::cout << "sampled at 0x" << std::hex << value << std::dec << " for " << stats[current_stats].output_file << endl;
  // Reject out-of-bounds samples
! if (value < this->stats[current_stats].limit_min || value > this->stats[current_stats].limit_max) return;
 
! stats[current_stats].value_count ++;
 
! assert (this->stats[current_stats].bucket_size != 0);
! host_int_8 quantized = (value / this->stats[current_stats].bucket_size) * this->stats[current_stats].bucket_size;
 
! if (quantized < this->stats[current_stats].value_min) this->stats[current_stats].value_min = quantized;
! if (quantized > this->stats[current_stats].value_max) this->stats[current_stats].value_max = quantized;
! this->stats[current_stats].value_hitcount_map [quantized] += ticks;
        }
 
      void accumulate_call (host_int_4 selfpc_low)
--- 275,295 ----
        {
  if (! this->target_component) return;
 
! host_int_8 value = (((host_int_8)this->pc_hi_pin.sense ()) << 32) | (this->pc_pin.sense () & 0xffffffff);
 
  // std::cout << "sampled at 0x" << std::hex << value << std::dec << " for " << stats[current_stats].output_file << endl;
  // Reject out-of-bounds samples
! statistics &st = this->stats[current_stats];
! if (value < st.limit_min || value > st.limit_max) return;
 
! st.value_count ++;
 
! assert (st.bucket_size != 0);
! host_int_8 quantized = (value / st.bucket_size) * st.bucket_size;
 
! if (quantized < st.value_min) st.value_min = quantized;
! if (quantized > st.value_max) st.value_max = quantized;
! st.value_hitcount_map [quantized] += ticks;
        }
 
      void accumulate_call (host_int_4 selfpc_low)
*************** namespace profiling_components
*** 306,326 ****
  host_int_8 callerpc = (((host_int_8)this->cg_caller_hi_pin.sense ()) << 32) | (this->cg_caller_pin.sense () & 0xffffffff);
 
  // Reject out-of-bounds samples
! if (selfpc < this->stats[current_stats].limit_min || selfpc > this->stats[current_stats].limit_max) return;
! if (callerpc < this->stats[current_stats].limit_min || callerpc > this->stats[current_stats].limit_max) return;
!
! stats[current_stats].value_count ++;
!
! assert (this->stats[current_stats].bucket_size != 0);
! host_int_8 c_quantized = (callerpc / this->stats[current_stats].bucket_size) * this->stats[current_stats].bucket_size;
! host_int_8 s_quantized = (selfpc / this->stats[current_stats].bucket_size) * this->stats[current_stats].bucket_size;
!
! if (c_quantized < this->stats[current_stats].value_min) this->stats[current_stats].value_min = c_quantized;
! if (s_quantized < this->stats[current_stats].value_min) this->stats[current_stats].value_min = s_quantized;
! if (c_quantized > this->stats[current_stats].value_max) this->stats[current_stats].value_max = c_quantized;
! if (s_quantized > this->stats[current_stats].value_max) this->stats[current_stats].value_max = s_quantized;
 
! this->stats[current_stats].cg_count_map [make_pair(c_quantized,s_quantized)] ++;
        }
 
 
--- 298,319 ----
  host_int_8 callerpc = (((host_int_8)this->cg_caller_hi_pin.sense ()) << 32) | (this->cg_caller_pin.sense () & 0xffffffff);
 
  // Reject out-of-bounds samples
! statistics &st = this->stats[current_stats];
! if (selfpc < st.limit_min || selfpc > st.limit_max) return;
! if (callerpc < st.limit_min || callerpc > st.limit_max) return;
!
! st.value_count ++;
!
! assert (st.bucket_size != 0);
! host_int_8 c_quantized = (callerpc / st.bucket_size) * st.bucket_size;
! host_int_8 s_quantized = (selfpc / st.bucket_size) * st.bucket_size;
!
! if (c_quantized < st.value_min) st.value_min = c_quantized;
! if (s_quantized < st.value_min) st.value_min = s_quantized;
! if (c_quantized > st.value_max) st.value_max = c_quantized;
! if (s_quantized > st.value_max) st.value_max = s_quantized;
 
! st.cg_count_map [make_pair(c_quantized,s_quantized)] ++;
        }
 
 
*************** namespace profiling_components
*** 563,569 ****
 
    public:
      gprof_component ():
-       target_attribute ("pc"),
        target_component (0),
        output_file_format (endian_unknown),
        accumulate_pin (this, & gprof_component::accumulate),
--- 556,561 ----
*************** namespace profiling_components
*** 577,582 ****
--- 569,578 ----
 
  add_pin ("sample", & this->accumulate_pin);
  add_attribute ("sample", & this->accumulate_pin, "pin");
+ add_pin ("pc", & this->pc_pin);
+ add_attribute ("pc", & this->pc_pin, "pin");
+ add_pin ("pc-hi", & this->pc_hi_pin);
+ add_attribute ("pc-hi", & this->pc_hi_pin, "pin");
  add_pin ("cg-caller", & this->cg_caller_pin);
  add_attribute ("cg-caller", & this->cg_caller_pin, "pin");
  add_pin ("cg-caller-hi", & this->cg_caller_hi_pin);
*************** namespace profiling_components
*** 617,623 ****
        & gprof_component::limit_max_get,
        & gprof_component::limit_max_set,
        "setting");
- add_attribute ("value-attribute", & this->target_attribute, "setting");
  add_attribute_virtual ("output-file", this,
        & gprof_component::output_file_get,
        & gprof_component::output_file_set,
--- 613,618 ----
*************** namespace profiling_components
*** 625,630 ****
--- 620,626 ----
  add_attribute ("output-file-endianness", & this->output_file_format, "setting");
  add_uni_relation ("target-component", & this->target_component);
 
+ pc_hi_pin.driven (0);
  cg_caller_hi_pin.driven (0);
  cg_callee_hi_pin.driven (0);
        }
Index: sid/include/sidcpuutil.h
===================================================================
RCS file: /cvs/src/src/sid/include/sidcpuutil.h,v
retrieving revision 1.37
diff -c -p -r1.37 sidcpuutil.h
*** sid/include/sidcpuutil.h 20 Jun 2006 18:13:45 -0000 1.37
--- sid/include/sidcpuutil.h 26 Jun 2006 19:26:10 -0000
*************** namespace sidutil
*** 253,258 ****
--- 253,260 ----
      output_pin cg_callee_pin;
      output_pin cg_jump_pin;
      output_pin cg_return_pin;
+     output_pin gprof_pc_hi_pin;
+     output_pin gprof_pc_pin;
      output_pin sample_gprof_pin;
     
      // tracing
*************** namespace sidutil
*** 417,422 ****
--- 419,426 ----
        gprof_counter += current_cycle - this->gprof_prev_cycle;
        this->gprof_prev_cycle = current_cycle;
 
+       this->gprof_pc_hi_pin.drive (this->get_pc_hi ());
+       this->gprof_pc_pin.drive (this->get_pc ());
        if (this->gprof_cycles == 0)
  {
   // Sample for gprof in insn-count mode.
*************** namespace sidutil
*** 514,519 ****
--- 518,525 ----
    private:
      callback_pin<basic_cpu> pc_set_pin;
      virtual void set_pc(sid::host_int_4) = 0;
+     virtual sid::host_int_4 get_pc() = 0;
+     virtual sid::host_int_4 get_pc_hi() { return 0; }
      void pc_set_pin_handler(sid::host_int_4 v) { this->set_pc (v); }
 
      // Set the initial endianness after reset
*************** namespace sidutil
*** 860,869 ****
  sid::host_int_4 length  = addr_and_length.second;
 
  // We'll need the current pc.
! std::string pc_str = this->attribute_value ("pc");
! sid::host_int_4 pc;
! sid::component::status rc = sidutil::parse_attribute (pc_str, pc);
! assert (rc == sid::component::ok);
 
  // Just read from insn memory by default
  switch (length)
--- 866,872 ----
  sid::host_int_4 length  = addr_and_length.second;
 
  // We'll need the current pc.
! sid::host_int_4 pc = get_pc ();
 
  // Just read from insn memory by default
  switch (length)
*************** public:
*** 951,956 ****
--- 954,961 ----
  add_pin ("cg-jump", & this->cg_jump_pin);
  add_pin ("print-insn-summary!", & this->print_insn_summary_pin);
  add_pin ("sample-gprof", &sample_gprof_pin);
+ add_pin ("gprof-pc-hi", &gprof_pc_hi_pin);
+ add_pin ("gprof-pc", &gprof_pc_pin);
  add_pin ("endian-set!", & this->endian_set_pin);
  add_pin ("eflags-set!", & this->eflags_set_pin);
  add_watchable_pin ("trap", & this->trap_type_pin); // output side
Index: sid/main/dynamic/commonCfg.cxx
===================================================================
RCS file: /cvs/src/src/sid/main/dynamic/commonCfg.cxx,v
retrieving revision 1.14
diff -c -p -r1.14 commonCfg.cxx
*** sid/main/dynamic/commonCfg.cxx 20 Jun 2006 18:40:20 -0000 1.14
--- sid/main/dynamic/commonCfg.cxx 26 Jun 2006 19:26:10 -0000
*************** GprofCfg::GprofCfg (const string name,
*** 1009,1015 ****
    relate (this, "target-component", cpu);
    conn_pin (cpu, "cg-caller", this, "cg-caller");
    conn_pin (cpu, "cg-callee", this, "cg-callee");
!   set (this, "value-attribute", "pc");
    set (this, "bucket-size", "4"); // bytes-per-bucket
    set (this, "output-file", filename);
  }
--- 1009,1016 ----
    relate (this, "target-component", cpu);
    conn_pin (cpu, "cg-caller", this, "cg-caller");
    conn_pin (cpu, "cg-callee", this, "cg-callee");
!   conn_pin (cpu, "gprof-pc-hi", this, "pc-hi");
!   conn_pin (cpu, "gprof-pc", this, "pc");
    set (this, "bucket-size", "4"); // bytes-per-bucket
    set (this, "output-file", filename);
  }
*************** GprofCfg::GprofCfg (const string name,
*** 1028,1034 ****
 
    sess->shutdown_seq->add_output (7, this, "store");
    relate (this, "target-component", cpu);
-   set (this, "value-attribute", "pc");
    set (this, "bucket-size", "4"); // bytes-per-bucket
  }
 
--- 1029,1034 ----
Reply | Threaded
Open this post in threaded view
|

Re: [patch][rfa] --gprof Performance Improvement

Frank Ch. Eigler
Hi -

> [...]
> Comments? OK to commit?

Nicely done, thanks.  (If backward compatibility was at all a concern,
the gprof component could have a new "sample" pin that assumes the
pin-based PC traffic, instead of changing the current interface.  But
I agree that the old interface is not worth saving.)

Was the string conversion stuff obvious in profiling output?

- FChE
Reply | Threaded
Open this post in threaded view
|

Re: [patch][rfa] --gprof Performance Improvement

Dave Brolley-2
Frank Ch. Eigler wrote:

>Nicely done, thanks.  (If backward compatibility was at all a concern,
>the gprof component could have a new "sample" pin that assumes the
>pin-based PC traffic, instead of changing the current interface.  But
>I agree that the old interface is not worth saving.)
>
>Was the string conversion stuff obvious in profiling output?
>  
>
I don't know --- I took a more brute force approach, since I knew that
the overhead was in the driving of the cpu's sample-gprof pin (the only
effect of using --gprof). I first suspected overhead in the std::map
used to collect the buckets, but that turned out to be small. I then
tried the "local reference for this->stats[current_stats]" optimization
and got the 3%. I then tried experimentally removing the collection of
the data and found no improvement. That left the parsing of the
attributes which, when experimentally removed, accounted for all of the
nasty overhead.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: [patch][rfa] --gprof Performance Improvement

Dave Brolley-2
...committed