Note on choosing string hash functions

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Note on choosing string hash functions

Dmitry Antipov
I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:

r = r * 67 + c - 113

but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:

((hash) * 67 + tolower ((unsigned char) (c)) - 113)

Everyone assumes that 'tolower' is simple, fast, and usually implemented
as a macro. But when >1M of mangled C++ symbols should be hashed, results
may be somewhat surprising ('perf report'):

      8.94%  gdb      libc-2.25.so              [.] tolower
      8.45%  gdb      gdb                       [.] d_print_comp_inner
      5.29%  gdb      gdb                       [.] htab_hash_string
      4.16%  gdb      gdb                       [.] minimal_symbol_reader::install
      3.72%  gdb      libc-2.25.so              [.] __memmove_sse2_unaligned_erms
      3.63%  gdb      gdb                       [.] msymbol_hash_iw
      3.24%  gdb      gdb                       [.] htab_find_slot
      3.07%  gdb      gdb                       [.] rust_is_mangled
      2.57%  gdb      gdb                       [.] eq_demangled_name_entry
      2.42%  gdb      gdb                       [.] d_print_comp
      2.36%  gdb      gdb                       [.] compare_minimal_symbols
      2.32%  gdb      gdb                       [.] default_search_name_hash
      2.20%  gdb      gdb                       [.] skip_spaces
      2.13%  gdb      gdb                       [.] rust_demangle_sym
      2.02%  gdb      gdb                       [.] hash_demangled_name_entry
      2.00%  gdb      gdb                       [.] tolower@plt
      1.90%  gdb      gdb                       [.] d_count_templates_scopes
      1.89%  gdb      gdb                       [.] ada_decode
      1.87%  gdb      libc-2.25.so              [.] msort_with_tmp.part.0
      1.48%  gdb      libc-2.25.so              [.] strlen
      1.46%  gdb      gdb                       [.] htab_expand
      1.39%  gdb      [kernel.kallsyms]         [.] native_irq_return_iret
      1.34%  gdb      libc-2.25.so              [.] __strchr_sse2
      1.27%  gdb      gdb                       [.] d_name
      1.17%  gdb      gdb                       [.] cplus_demangle_type
      1.11%  gdb      libc-2.25.so              [.] _int_malloc
      ...

Dropping the 'tolower' from SYMBOL_HASH_NEXT immediately shows:

      9.26%  gdb      gdb                       [.] d_print_comp_inner
      6.97%  gdb      gdb                       [.] minimal_symbol_reader::install
      5.80%  gdb      gdb                       [.] htab_hash_string
      3.59%  gdb      libc-2.25.so              [.] __memmove_sse2_unaligned_erms
      3.41%  gdb      gdb                       [.] htab_find_slot
      3.37%  gdb      gdb                       [.] rust_is_mangled
      3.13%  gdb      libc-2.25.so              [.] isspace
      2.91%  gdb      gdb                       [.] eq_demangled_name_entry
      2.75%  gdb      gdb                       [.] d_print_comp
      2.56%  gdb      gdb                       [.] compare_minimal_symbols
      2.42%  gdb      gdb                       [.] msymbol_hash_iw
      2.37%  gdb      gdb                       [.] skip_spaces
      2.33%  gdb      gdb                       [.] default_search_name_hash
      2.32%  gdb      gdb                       [.] rust_demangle_sym
      2.14%  gdb      gdb                       [.] ada_decode
      2.14%  gdb      gdb                       [.] d_count_templates_scopes
      2.05%  gdb      gdb                       [.] hash_demangled_name_entry
      1.97%  gdb      libc-2.25.so              [.] msort_with_tmp.part.0
      1.72%  gdb      libc-2.25.so              [.] strlen
      1.61%  gdb      [kernel.kallsyms]         [.] native_irq_return_iret
      1.56%  gdb      gdb                       [.] htab_expand
      1.55%  gdb      libc-2.25.so              [.] __strchr_sse2
      1.45%  gdb      gdb                       [.] d_name
      1.41%  gdb      libc-2.25.so              [.] _int_malloc
      1.28%  gdb      gdb                       [.] cplus_demangle_type
      1.03%  gdb      gdb                       [.] minimal_symbol_reader::record_full
      1.02%  gdb      gdb                       [.] internal_cplus_demangle
      1.01%  gdb      gdb                       [.] elf_symtab_read
      ...

Observed on current binutils-gdb trunk, glibc 2.25, gcc 7.2.1, -g -O3 -march=native -mtune=native,
"remote" (both gdb and gdbserver on the same system) debugging Firefox debug build, reports was generated
with 'perf record -F 10000 gdb -batch --readnever -q -ex "set sysroot /" -ex "set pagination off" \
-ex "target extended-remote :5555" -ex "thread apply all bt" -ex "quit"'.

Dmitry
Reply | Threaded
Open this post in threaded view
|

Re: Note on choosing string hash functions

Pedro Alves-7
On 11/17/2017 09:48 AM, Dmitry Antipov wrote:

> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
>
> r = r * 67 + c - 113
>
> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
>
> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
>
> Everyone assumes that 'tolower' is simple, fast, and usually implemented
> as a macro. But when >1M of mangled C++ symbols should be hashed, results
> may be somewhat surprising ('perf report'):

I've noticed tolower show up in perf too, along with isspace,
in strncmp_iw/strncmp_iw_with_mode and friends.

I think the tolower is there for source languages that
are case insensitive, like Ada and Fortran.  (Or if you do
"set case-sensitive off", but I think that's a
misfeature...).

I haven't explicitly tried measuring perf with this patch:

 https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html

but I think it should improve performance significantly,
since safe-ctype.h is locale independent.

(
Of course, this ignores non-ASCII code points, but I'm not
sure we really handle non-ASCII correctly everywhere else,
i.e., whether it makes a difference today.  Certainly the
locale when the program was compiled generally has no relation
to the current user's locale, even though they frequently
match.  If it does make a difference, one way to handle
it that may be still cheap is to hash all non-ASCII (and utf8
multi-byte sequences) to the same, while still doing proper
lowercase comparisons when actually comparing symbols that end
up in the bucket.  Something like:

#define SYMBOL_HASH_NEXT(hash, c) \
  ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)

I'd assume that most symbol names are wholly or at least mostly
ASCII and that that wouldn't result in too many collisions.

But then again, I don't really know how compilers for case-insensitive
languages handle non-ASCII symbol names in different cases, especially
considering multibyte sequences.
)

>
> Observed on current binutils-gdb trunk, glibc 2.25, gcc 7.2.1, -g -O3
> -march=native -mtune=native,
> "remote" (both gdb and gdbserver on the same system) debugging Firefox
> debug build, reports was generated
> with 'perf record -F 10000 gdb -batch --readnever -q -ex "set sysroot /"
> -ex "set pagination off" \
> -ex "target extended-remote :5555" -ex "thread apply all bt" -ex "quit"'.

Thanks,
Pedro Alves

Reply | Threaded
Open this post in threaded view
|

Re: Note on choosing string hash functions

Pedro Alves-7
On 11/17/2017 01:42 PM, Pedro Alves wrote:

> On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
>> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
>>
>> r = r * 67 + c - 113
>>
>> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
>>
>> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
>>
>> Everyone assumes that 'tolower' is simple, fast, and usually implemented
>> as a macro. But when >1M of mangled C++ symbols should be hashed, results
>> may be somewhat surprising ('perf report'):
>
> I've noticed tolower show up in perf too, along with isspace,
> in strncmp_iw/strncmp_iw_with_mode and friends.
>
> I think the tolower is there for source languages that
> are case insensitive, like Ada and Fortran.  (Or if you do
> "set case-sensitive off", but I think that's a
> misfeature...).
>
> I haven't explicitly tried measuring perf with this patch:
>
>  https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html
>
> but I think it should improve performance significantly,
> since safe-ctype.h is locale independent.

I looked at performance now, and while there's an improvement,
it isn't nearly as significant as I'd hope.

> (
> Of course, this ignores non-ASCII code points, but I'm not
> sure we really handle non-ASCII correctly everywhere else,
> i.e., whether it makes a difference today.  Certainly the
> locale when the program was compiled generally has no relation
> to the current user's locale, even though they frequently
> match.  If it does make a difference, one way to handle
> it that may be still cheap is to hash all non-ASCII (and utf8
> multi-byte sequences) to the same, while still doing proper
> lowercase comparisons when actually comparing symbols that end
> up in the bucket.  Something like:
>
> #define SYMBOL_HASH_NEXT(hash, c) \
>   ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)
>
> I'd assume that most symbol names are wholly or at least mostly
> ASCII and that that wouldn't result in too many collisions.
>
> But then again, I don't really know how compilers for case-insensitive
> languages handle non-ASCII symbol names in different cases, especially
> considering multibyte sequences.

I'm looking at merging the rest of that series to master, and
so I've been looking at this issue today, mostly to convince myself
that the tolower -> TOLOWER change can't cause a regression.

It could cause a difference for case-insensitive languages,
if e.g., Latin-1 encoded symbol names could appear in the minimal
symbol tables, _and_ GDB is running with a Latin-1 locale too.
In that case, with tolower, we'd hash "funcá" / "funcÁ" to the same,
but not with TOLOWER, which is strictly ASCII.
I'm pretty convinced that scenario is not real.

First, nowadays UTF-8 locales are pretty much the norm in most
distros (over other ASCII-superset charsets), and tolower on the
individual bytes does nothing on UTF-8.

Then, I played with making Ada/gnat and both Latin-1 and UTF-8 sources
files (the latter with "pragma Wide_Character_Encoding (UTF8)"), and
what I discovered was that Ada's encoding/mangling guarantees that only
ASCII characters end up in mangled names.  From gcc/ada/namet.ads:

~~~
--    Identifiers        Stored with upper case letters folded to lower case.
--                       Upper half (16#80# bit set) and wide characters are
--                       stored in an encoded form (Uhh for upper half char,
--                       Whhhh for wide characters, WWhhhhhhhh as provided by
--                       the routine Append_Encoded, where hh are hex
--                       digits for the character code using lower case a-f).
--                       Normally the use of U or W in other internal names is
--                       avoided, but these letters may be used in internal
--                       names (without this special meaning), if they appear
--                       as the last character of the name, or they are
--                       followed by an upper case letter (other than the WW
--                       sequence), or an underscore.
~~~

Funny enough, GDB doesn't grok this Uhh/WWhhhhhhhh encoding today.
(I wrote a quick patch to teach GDB about it, to help convince myself,
though as is, it only works when gdb's charset/locale is UTF-8.)

Then I looked at Fortran, but I couldn't make gfortran use/grok
non-ASCII identifiers.  Looking at
  https://gcc.gnu.org/onlinedocs/gfortran/SELECTED_005fCHAR_005fKIND.html
and
  https://github.com/jacobwilliams/json-fortran/wiki/Unicode-How-To

I believe that's just not supported.  So I moved on.

Then there's C/C++.  But even here both GCC and Clang
only allow UTF-8 in identifiers (GCC via UCNs, and Clang because
that's the only charset that it supports), and since
GCC uses UTF-8 internally, I believe that that's the encoding it
always uses for mangling.  That's what I see in my experiments.
I tried --input-charset=ISO-8859-1 --exec-charset=ISO-8859-1
with a Latin-1 source file, and I made sure I had a non-ASCII
identifier as well as a Latin-1 string literal in there, and then confirmed
by debugging the result with gdb that the string literal really was
Latin-1.  Still, the mangled identifier name is encoded in UTF-8
(confirmed with hexdump).  AFAICT, I always see UTF-8 in strings in DWARF
too (and the DWARF spec recommends this).  So tolower/TOLOWER can't make
a difference here either.    That's good, it means the ABI/mangling doesn't
depend on whatever's the host charset, and neither does the DWARF.  It's
possible other compilers/mangling schemes/ABIs do something else, but
at this point I'm not aware of any, but I hope that that's nothing we'd
need to support.

I think Rust is UTF-8 always.  I'd assume Go too, given
its authorship/ancestry...

In the performance aspect, there's a little bit of
improvement.  With this:

$ cat ff.cmd
set pagination off

define ff
  attach 19018 # Firefox's PID.
  thread apply all bt
  detach
  file
  end

set $count = 10
while $count != 0
  ff
  set $count = $count - 1
  end

$ time $g -q --batch --readnever -x ff.cmd

I observe between 3% / 10% time drop.

(I used Joel's version of --readnever from here:
 https://sourceware.org/ml/gdb-patches/2016-07/msg00103.html)

So in sum, I'm pretty convinced the patch is safe as is.

Thanks,
Pedro Alves

Reply | Threaded
Open this post in threaded view
|

Re: Note on choosing string hash functions

Pedro Alves-7
On 11/22/2017 02:10 AM, Pedro Alves wrote:

> On 11/17/2017 01:42 PM, Pedro Alves wrote:
>
> Then, I played with making Ada/gnat and both Latin-1 and UTF-8 sources
> files (the latter with "pragma Wide_Character_Encoding (UTF8)"), and
> what I discovered was that Ada's encoding/mangling guarantees that only
> ASCII characters end up in mangled names.  From gcc/ada/namet.ads:
>
> ~~~
> --    Identifiers        Stored with upper case letters folded to lower case.
> --                       Upper half (16#80# bit set) and wide characters are
> --                       stored in an encoded form (Uhh for upper half char,
> --                       Whhhh for wide characters, WWhhhhhhhh as provided by
> --                       the routine Append_Encoded, where hh are hex
> --                       digits for the character code using lower case a-f).
> --                       Normally the use of U or W in other internal names is
> --                       avoided, but these letters may be used in internal
> --                       names (without this special meaning), if they appear
> --                       as the last character of the name, or they are
> --                       followed by an upper case letter (other than the WW
> --                       sequence), or an underscore.
> ~~~
>
> Funny enough, GDB doesn't grok this Uhh/WWhhhhhhhh encoding today.
> (I wrote a quick patch to teach GDB about it, to help convince myself,
> though as is, it only works when gdb's charset/locale is UTF-8.)

For the record, here's what that patch looks like.

From 710bde831ed78641e175046e0711a35d5061d7ee Mon Sep 17 00:00:00 2001
From: Pedro Alves <[hidden email]>
Date: Tue, 21 Nov 2017 20:05:42 +0000
Subject: [PATCH] Ada: Support Uhh encoding, UTF-8

An attempt at checking whether TOLOWER for minsyms makes a difference
over tolower...

It doesn't, Ada's encoding encodes "upper half char"s using Uff, so
non-ASCII characters don't appear in the mangled names...

The Ada lexer change is necessary so that it's possible to input UTF-8
in expressions.

This assumes the host encoding is UTF-8 as is...  I wonder... maybe
GDB should always use UTF-8 internally, and translate host-encoding ->
UTF-8 at the readline -> GDB boundary.

Yes, the test passes.  :-)
---
 gdb/ada-lang.c                     | 30 +++++++++++++++++++++
 gdb/ada-lex.l                      |  2 +-
 gdb/common/rsp-low.c               |  2 +-
 gdb/common/rsp-low.h               |  4 +++
 gdb/testsuite/gdb.ada/utf8.exp     | 53 ++++++++++++++++++++++++++++++++++++++
 gdb/testsuite/gdb.ada/utf8/foo.adb | 25 ++++++++++++++++++
 gdb/testsuite/gdb.ada/utf8/pck.adb | 26 +++++++++++++++++++
 gdb/testsuite/gdb.ada/utf8/pck.ads | 22 ++++++++++++++++
 8 files changed, 162 insertions(+), 2 deletions(-)
 create mode 100644 gdb/testsuite/gdb.ada/utf8.exp
 create mode 100644 gdb/testsuite/gdb.ada/utf8/foo.adb
 create mode 100644 gdb/testsuite/gdb.ada/utf8/pck.adb
 create mode 100644 gdb/testsuite/gdb.ada/utf8/pck.ads

diff --git a/gdb/ada-lang.c b/gdb/ada-lang.c
index 33c4e8e..d0fb06d 100644
--- a/gdb/ada-lang.c
+++ b/gdb/ada-lang.c
@@ -63,6 +63,7 @@
 #include "common/function-view.h"
 #include "common/byte-vector.h"
 #include <algorithm>
+#include "common/rsp-low.h"
 
 /* Define whether or not the C operator '/' truncates towards zero for
    differently signed operands (truncation direction is undefined in C).
@@ -1007,6 +1008,19 @@ ada_encode_1 (const char *decoded, bool throw_errors)
           encoding_buffer[k] = encoding_buffer[k + 1] = '_';
           k += 2;
         }
+      else if (((unsigned char) *p & 0xe0) == 0xc0)
+ {
+  /* "Uhh" Ada encoding -> UTF-8 character.  */
+
+  unsigned char c1 = p[0];
+  unsigned char c2 = p[1];
+  unsigned char c = (c1 << 6) | (c2 & (0xff >> 2));
+  p += 1;
+
+  encoding_buffer[k] = 'U';
+  pack_hex_byte (&encoding_buffer[k + 1], c);
+  k += 3;
+ }
       else if (*p == '"')
         {
           const struct ada_opname_map *mapping;
@@ -1355,6 +1369,8 @@ ada_decode (const char *encoded)
             i++;
         }
 
+      std::pair<int, int> nibbles;
+
       if (encoded[i] == 'X' && i != 0 && isalnum (encoded[i - 1]))
         {
           /* This is a X[bn]* sequence not separated from the previous
@@ -1378,6 +1394,20 @@ ada_decode (const char *encoded)
           i += 2;
           j += 1;
         }
+      else if (len0 - i > 3
+       && encoded[i] == 'U'
+       && ishex (encoded[i + 1], &nibbles.first)
+       && ishex (encoded[i + 2], &nibbles.second))
+ {
+  /* Convert Ada upper half char encoding to UTF-8 character
+     (2 bytes code point).  */
+  unsigned char c = nibbles.first << 4 | nibbles.second;
+
+  decoded[j] = 0xc0 | c >> 6;
+  decoded[j + 1] = 0x80 | (c & 0x03f);
+  i += 3;
+  j += 2;
+ }
       else
         {
           /* It's a character part of the decoded name, so just copy it
diff --git a/gdb/ada-lex.l b/gdb/ada-lex.l
index 63137bd..41b0582 100644
--- a/gdb/ada-lex.l
+++ b/gdb/ada-lex.l
@@ -29,7 +29,7 @@ NUM10 ({DIG}({DIG}|_)*)
 HEXDIG [0-9a-f]
 NUM16 ({HEXDIG}({HEXDIG}|_)*)
 OCTDIG [0-7]
-LETTER [a-z_]
+LETTER [a-z_\x80-\xff]
 ID ({LETTER}({LETTER}|{DIG})*|"<"{LETTER}({LETTER}|{DIG})*">")
 WHITE [ \t\n]
 TICK ("'"{WHITE}*)
diff --git a/gdb/common/rsp-low.c b/gdb/common/rsp-low.c
index 85987f7..3209693 100644
--- a/gdb/common/rsp-low.c
+++ b/gdb/common/rsp-low.c
@@ -50,7 +50,7 @@ tohex (int nib)
 
 static const char hexchars[] = "0123456789abcdef";
 
-static int
+int
 ishex (int ch, int *val)
 {
   if ((ch >= 'a') && (ch <= 'f'))
diff --git a/gdb/common/rsp-low.h b/gdb/common/rsp-low.h
index 99dc93f..947ee20 100644
--- a/gdb/common/rsp-low.h
+++ b/gdb/common/rsp-low.h
@@ -20,6 +20,10 @@
 #ifndef COMMON_RSP_LOW_H
 #define COMMON_RSP_LOW_H
 
+/* FIXME: comment.  */
+
+extern int ishex (int ch, int *val);
+
 /* Convert hex digit A to a number, or throw an exception.  */
 
 extern int fromhex (int a);
diff --git a/gdb/testsuite/gdb.ada/utf8.exp b/gdb/testsuite/gdb.ada/utf8.exp
new file mode 100644
index 0000000..4e5fc01
--- /dev/null
+++ b/gdb/testsuite/gdb.ada/utf8.exp
@@ -0,0 +1,53 @@
+# -*-mode: tcl; coding: utf-8;-*-
+#
+# Copyright 2017 Free Software Foundation, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# Test GDB's support for symbols with UTF-8 multi-byte symbol names.
+
+# Actually, we're only testing "Uff" (Latin1 page) encoded names,
+# i.e., upper half char characters.  Wider characters have a different
+# Ada encoding which we don't support yet.
+
+load_lib "ada.exp"
+
+# Enable basic use of UTF-8.  This is restored automatically for every
+# testcase.
+setenv LC_ALL C.UTF-8
+
+standard_ada_testfile foo
+
+if {[gdb_compile_ada "${srcfile}" "${binfile}" executable {debug}] != "" } {
+  return -1
+}
+
+clean_restart ${testfile}
+
+if ![runto_main] then {
+  perror "Couldn't run ${testfile}"
+  return
+}
+
+# Check printing an expression involving an UTF8 symbol name.
+gdb_test "print &pck.funcáx" \
+         " = \\(access function \\(a1: integer\\) return integer\\) $hex <pck.funcáx>"
+
+# Check setting a breakpoint in a function with an UTF8 symbol name.
+gdb_test "b pck.funcáx" "Breakpoint $decimal .*"
+
+# Test running to the breakpoint, confirm GDB prints the function name
+# correctly.
+gdb_test "continue" "Breakpoint $decimal, pck.funcáx \\(i=1\\).*"
+
diff --git a/gdb/testsuite/gdb.ada/utf8/foo.adb b/gdb/testsuite/gdb.ada/utf8/foo.adb
new file mode 100644
index 0000000..f49ab49
--- /dev/null
+++ b/gdb/testsuite/gdb.ada/utf8/foo.adb
@@ -0,0 +1,25 @@
+-- -*-mode: Ada; coding: utf-8;-*-
+
+--  Copyright 2017 Free Software Foundation, Inc.
+--
+--  This program is free software; you can redistribute it and/or modify
+--  it under the terms of the GNU General Public License as published by
+--  the Free Software Foundation; either version 3 of the License, or
+--  (at your option) any later version.
+--
+--  This program is distributed in the hope that it will be useful,
+--  but WITHOUT ANY WARRANTY; without even the implied warranty of
+--  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+--  GNU General Public License for more details.
+--
+--  You should have received a copy of the GNU General Public License
+--  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+pragma Wide_Character_Encoding (UTF8);
+
+with Pck; use Pck;
+procedure Foo is
+   I : Integer := 1;
+begin
+   FuncÁx (I);
+end Foo;
diff --git a/gdb/testsuite/gdb.ada/utf8/pck.adb b/gdb/testsuite/gdb.ada/utf8/pck.adb
new file mode 100644
index 0000000..a4a4962
--- /dev/null
+++ b/gdb/testsuite/gdb.ada/utf8/pck.adb
@@ -0,0 +1,26 @@
+-- -*-mode: Ada; coding: utf-8;-*-
+
+--  Copyright 2017 Free Software Foundation, Inc.
+--
+--  This program is free software; you can redistribute it and/or modify
+--  it under the terms of the GNU General Public License as published by
+--  the Free Software Foundation; either version 3 of the License, or
+--  (at your option) any later version.
+--
+--  This program is distributed in the hope that it will be useful,
+--  but WITHOUT ANY WARRANTY; without even the implied warranty of
+--  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+--  GNU General Public License for more details.
+--
+--  You should have received a copy of the GNU General Public License
+--  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+pragma Wide_Character_Encoding (UTF8);
+
+package body Pck is
+   procedure FuncÁx (I: in out Integer) is
+   begin
+      I := I + 1;
+   end FuncÁx;
+
+end Pck;
diff --git a/gdb/testsuite/gdb.ada/utf8/pck.ads b/gdb/testsuite/gdb.ada/utf8/pck.ads
new file mode 100644
index 0000000..3978ba4
--- /dev/null
+++ b/gdb/testsuite/gdb.ada/utf8/pck.ads
@@ -0,0 +1,22 @@
+-- -*-mode: Ada; coding: utf-8;-*-
+
+--  Copyright 2017 Free Software Foundation, Inc.
+--
+--  This program is free software; you can redistribute it and/or modify
+--  it under the terms of the GNU General Public License as published by
+--  the Free Software Foundation; either version 3 of the License, or
+--  (at your option) any later version.
+--
+--  This program is distributed in the hope that it will be useful,
+--  but WITHOUT ANY WARRANTY; without even the implied warranty of
+--  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+--  GNU General Public License for more details.
+--
+--  You should have received a copy of the GNU General Public License
+--  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+pragma Wide_Character_Encoding (UTF8);
+
+package Pck is
+   procedure FuncÁx (I: in out Integer);
+end Pck;
--
2.5.5


Reply | Threaded
Open this post in threaded view
|

Re: Note on choosing string hash functions

Dmitry Antipov
In reply to this post by Pedro Alves-7
On 11/22/2017 05:10 AM, Pedro Alves wrote:

> I observe between 3% / 10% time drop.
>
> (I used Joel's version of --readnever from here:
>   https://sourceware.org/ml/gdb-patches/2016-07/msg00103.html)
>
> So in sum, I'm pretty convinced the patch is safe as is.

BTW, shouldn't we use safe-ctype.h macros through gdb/utils.c as well?
With this, there is a substantial difference between:

      9.83%  gdb      gdb                       [.] find_pc_sect_psymtab
      6.54%  gdb      gdb                       [.] bcache_full
      5.38%  gdb      libc-2.25.so              [.] tolower                             ; hmm...
      4.48%  gdb      gdb                       [.] htab_find_slot_with_hash
      4.39%  gdb      gdb                       [.] htab_hash_string
      3.88%  gdb      gdb                       [.] load_partial_dies
      3.61%  gdb      gdb                       [.] strcmp_iw_ordered
      3.45%  gdb      gdb                       [.] read_indirect_string_at_offset_from
      3.18%  gdb      gdb                       [.] cpname_parse
      3.07%  gdb      gdb                       [.] lookup_minimal_symbol_by_pc_name
      2.70%  gdb      gdb                       [.] read_attribute_value
      2.68%  gdb      libc-2.25.so              [.] __strcmp_sse2_unaligned
      2.47%  gdb      gdb                       [.] cpname_lex
      2.44%  gdb      gdb                       [.] peek_die_abbrev
      2.32%  gdb      libc-2.25.so              [.] _int_malloc
      2.15%  gdb      gdb                       [.] d_print_comp_inner
      1.80%  gdb      libc-2.25.so              [.] isspace                             ; hmm...
      1.47%  gdb      gdb                       [.] cp_canonicalize_string[abi:cxx11]
      1.45%  gdb      gdb                       [.] eq_demangled_name_entry
      1.39%  gdb      libc-2.25.so              [.] malloc_consolidate.part.1

and:

     10.97%  gdb      gdb                       [.] find_pc_sect_psymtab
      7.25%  gdb      gdb                       [.] bcache_full
      4.82%  gdb      gdb                       [.] htab_hash_string
      4.65%  gdb      gdb                       [.] htab_find_slot_with_hash
      4.28%  gdb      gdb                       [.] load_partial_dies
      3.69%  gdb      gdb                       [.] read_indirect_string_at_offset_from
      3.45%  gdb      gdb                       [.] cpname_parse
      3.37%  gdb      gdb                       [.] lookup_minimal_symbol_by_pc_name
      3.03%  gdb      gdb                       [.] read_attribute_value
      3.01%  gdb      gdb                       [.] strcmp_iw_ordered
      2.99%  gdb      libc-2.25.so              [.] __strcmp_sse2_unaligned
      2.81%  gdb      gdb                       [.] peek_die_abbrev
      2.74%  gdb      gdb                       [.] cpname_lex
      2.57%  gdb      libc-2.25.so              [.] _int_malloc
      2.40%  gdb      gdb                       [.] d_print_comp_inner
      1.59%  gdb      gdb                       [.] eq_demangled_name_entry
      1.56%  gdb      libc-2.25.so              [.] malloc_consolidate.part.1
      1.56%  gdb      gdb                       [.] cp_canonicalize_string[abi:cxx11]
      1.00%  gdb      libc-2.25.so              [.] strlen

IIUC this is because strcmp_iw_ordered() is quite critical when building partial symbol tables.

Dmitry
Reply | Threaded
Open this post in threaded view
|

Re: Note on choosing string hash functions

Pedro Alves-7
On 11/28/2017 10:35 AM, Dmitry Antipov wrote:

> On 11/22/2017 05:10 AM, Pedro Alves wrote:
>
>> I observe between 3% / 10% time drop.
>>
>> (I used Joel's version of --readnever from here:
>>   https://sourceware.org/ml/gdb-patches/2016-07/msg00103.html)
>>
>> So in sum, I'm pretty convinced the patch is safe as is.
>
> BTW, shouldn't we use safe-ctype.h macros through gdb/utils.c as well?
> With this, there is a substantial difference between:
>
>      9.83%  gdb      gdb                       [.] find_pc_sect_psymtab
>      6.54%  gdb      gdb                       [.] bcache_full
>      5.38%  gdb      libc-2.25.so              [.]
> tolower                             ; hmm...
>      4.48%  gdb      gdb                       [.] htab_find_slot_with_hash
>      4.39%  gdb      gdb                       [.] htab_hash_string
>      3.88%  gdb      gdb                       [.] load_partial_dies
>      3.61%  gdb      gdb                       [.] strcmp_iw_ordered
>      3.45%  gdb      gdb                       [.]
> read_indirect_string_at_offset_from
>      3.18%  gdb      gdb                       [.] cpname_parse
>      3.07%  gdb      gdb                       [.]
> lookup_minimal_symbol_by_pc_name
>      2.70%  gdb      gdb                       [.] read_attribute_value
>      2.68%  gdb      libc-2.25.so              [.] __strcmp_sse2_unaligned
>      2.47%  gdb      gdb                       [.] cpname_lex
>      2.44%  gdb      gdb                       [.] peek_die_abbrev
>      2.32%  gdb      libc-2.25.so              [.] _int_malloc
>      2.15%  gdb      gdb                       [.] d_print_comp_inner
>      1.80%  gdb      libc-2.25.so              [.]
> isspace                             ; hmm...
>      1.47%  gdb      gdb                       [.]
> cp_canonicalize_string[abi:cxx11]
>      1.45%  gdb      gdb                       [.] eq_demangled_name_entry
>      1.39%  gdb      libc-2.25.so              [.]
> malloc_consolidate.part.1
>
> and:
>
>     10.97%  gdb      gdb                       [.] find_pc_sect_psymtab
>      7.25%  gdb      gdb                       [.] bcache_full
>      4.82%  gdb      gdb                       [.] htab_hash_string
>      4.65%  gdb      gdb                       [.] htab_find_slot_with_hash
>      4.28%  gdb      gdb                       [.] load_partial_dies
>      3.69%  gdb      gdb                       [.]
> read_indirect_string_at_offset_from
>      3.45%  gdb      gdb                       [.] cpname_parse
>      3.37%  gdb      gdb                       [.]
> lookup_minimal_symbol_by_pc_name
>      3.03%  gdb      gdb                       [.] read_attribute_value
>      3.01%  gdb      gdb                       [.] strcmp_iw_ordered
>      2.99%  gdb      libc-2.25.so              [.] __strcmp_sse2_unaligned
>      2.81%  gdb      gdb                       [.] peek_die_abbrev
>      2.74%  gdb      gdb                       [.] cpname_lex
>      2.57%  gdb      libc-2.25.so              [.] _int_malloc
>      2.40%  gdb      gdb                       [.] d_print_comp_inner
>      1.59%  gdb      gdb                       [.] eq_demangled_name_entry
>      1.56%  gdb      libc-2.25.so              [.]
> malloc_consolidate.part.1
>      1.56%  gdb      gdb                       [.]
> cp_canonicalize_string[abi:cxx11]
>      1.00%  gdb      libc-2.25.so              [.] strlen
>
> IIUC this is because strcmp_iw_ordered() is quite critical when building
> partial symbol tables.

Yes, I think so.

BTW, with --readnever, I noticed htab_hash_string at the top,
so I experimented with unrolling it, as in the patch below.  This
seems to make htab_hash_string/my_htab_hash_string disappear from the
perf log, and using the same ff.cmd script I shown before, I get:

  0m20.0s -> 0m18.9

I haven't really investigated whether this increases memory
use significantly.  It may be that the extra cost for the string
length is masked by malloc's round-up overhead already,
resulting in no overhead in practice.

Another thing that might be worth it to look at is to see if
we could determine a better initial size for the hash table
depending on number of elf symbols, to avoid too many rehashes.

'ada_decode' seems to be an unfortunate bottleneck that shows up
on top (with --readnever), coming from ada_sniff_from_mangled_name, trying
to sniff each minsym language from the mangled name, which is unfortunate
because that's a lot of work when no symbol turns out to be an Ada symbol...
I'd be good to see about making ada_decode's not-an-ada-symbol path
be cheaper.

An idea I had would be to try to combine most the language_sniff_from_mangled_name
implementations in one call, by deferring to cplus_demangle which seems
like could take care of gnu v3, rust, java, gnat, dlang and old C++ mangling
schemes all at once.  This would avoid the overhead of gdb_demangle and
bfd_demangle for each language-specific demangling call, at least.  Not sure.

And then we just spend a lot of in the C++ demangler, which
is kind of expected since there are a lot of C++ symbols to demangle.
Not sure there's much we can do in the demangler itself.  Maybe
avoiding malloc, and seeing about avoiding multiple passes over the
input string would help.  But ultimately, I think that the major win
would probably come from parallelizing minsym demangling.  It's kind of
silly that we don't take advantage of multiple cores here...

Anyway, this is just a brain dump from a little investigation I did this past
weekend, and I think that this area has a lot of scope for improvements, but
I won't be able to follow up on any of this myself in the next following
weeks at least, with several gdb 8.1 issues on my plate.

From 0567f3835cbe743ec9d294321ab1002039a3016f Mon Sep 17 00:00:00 2001
From: Pedro Alves <[hidden email]>
Date: Tue, 28 Nov 2017 11:17:32 +0000
Subject: [PATCH] unroll htab_hash_string

---
 gdb/symtab.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 81 insertions(+), 11 deletions(-)

diff --git a/gdb/symtab.c b/gdb/symtab.c
index 3d59367..c66874f 100644
--- a/gdb/symtab.c
+++ b/gdb/symtab.c
@@ -667,12 +667,77 @@ symbol_set_language (struct general_symbol_info *gsymbol,
     }
 }
 
+/* Hash P/LEN.  Unrolled version of
+   libiberty.c:hashtab.c:htab_hash_string, which is a string hash
+   function good for identifiers.  Should probably be moved to
+   hashtab.c itself.  */
+
+hashval_t
+my_htab_hash_string (const PTR p, int len)
+{
+  const unsigned char *string = (const unsigned char *) p;
+  unsigned int hash = 0;
+
+  int i = 0;
+
+#if 0
+  /* step 1 */
+  for (; i + 3 < len; i += 4)
+    {
+      unsigned int h1 = (hash) * 67 + string[i + 0] - 113;
+      unsigned int h2 = (has1) * 67 + string[i + 1] - 113;
+      unsigned int h3 = (has2) * 67 + string[i + 2] - 113;
+ hash = (has3) * 67 + string[i + 3] - 113;
+    }
+#endif
+
+#if 0
+  /* step 2 */
+  for (; i + 3 < len; i += 4)
+    {
+      unsigned int h2 = 67 * 67 * hash + string[i + 0] * 67 + string[i + 1] - 113 - 113 * 67;
+      unsigned int h3 = (has2) * 67 + string[i + 2] - 113;
+ hash = (has3) * 67 + string[i + 3] - 113;
+    }
+#endif
+
+#if 0
+  /* step 3 */
+  for (; i + 3 < len; i += 4)
+    {
+      unsigned int h3 = 67 * 67 * 67 * hash + string[i + 0] * 67 * 67 + string[i + 1] * 67 + string[i + 2] - 113 - 113 * 67 - 113 * 67 * 67;
+ hash = (h3) * 67 + string[i + 3] - 113;
+    }
+#endif
+
+  /* Unrolled a few times to compensate for multiplication latency due
+     to data dependencies.  */
+  for (; i + 3 < len; i += 4)
+    {
+      hash = (67 * 67 * 67 * 67 * hash
+      + 67 * 67 * 67 * string[i + 0]
+      + 67 * 67 * string[i + 1]
+      + 67 * string[i + 2]
+      + string[i + 3]
+      - 113 * 67
+      - 113 * 67 * 67
+      - 113 * 67 * 67 * 67
+      - 113);
+    }
+
+  for (; i < len; i++)
+    hash = hash * 67 + string[i] - 113;
+
+  return hash;
+}
+
 /* Functions to initialize a symbol's mangled name.  */
 
 /* Objects of this type are stored in the demangled name hash table.  */
 struct demangled_name_entry
 {
   const char *mangled;
+  uint32_t mangled_len;
   char demangled[1];
 };
 
@@ -684,7 +749,7 @@ hash_demangled_name_entry (const void *data)
   const struct demangled_name_entry *e
     = (const struct demangled_name_entry *) data;
 
-  return htab_hash_string (e->mangled);
+  return my_htab_hash_string (e->mangled, e->mangled_len);
 }
 
 /* Equality function for the demangled name hash.  */
@@ -697,6 +762,8 @@ eq_demangled_name_entry (const void *a, const void *b)
   const struct demangled_name_entry *db
     = (const struct demangled_name_entry *) b;
 
+  if (da->mangled_len != db->mangled_len)
+    return 0;
   return strcmp (da->mangled, db->mangled) == 0;
 }
 
@@ -770,8 +837,8 @@ symbol_find_demangled_name (struct general_symbol_info *gsymbol,
 
 void
 symbol_set_names (struct general_symbol_info *gsymbol,
-  const char *linkage_name, int len, int copy_name,
-  struct objfile *objfile)
+  const char *linkage_name, const int linkage_name_len,
+  int copy_name, struct objfile *objfile)
 {
   struct demangled_name_entry **slot;
   /* A 0-terminated copy of the linkage name.  */
@@ -788,10 +855,10 @@ symbol_set_names (struct general_symbol_info *gsymbol,
       else
  {
   char *name = (char *) obstack_alloc (&per_bfd->storage_obstack,
-       len + 1);
+       linkage_name_len + 1);
 
-  memcpy (name, linkage_name, len);
-  name[len] = '\0';
+  memcpy (name, linkage_name, linkage_name_len);
+  name[linkage_name_len] = '\0';
   gsymbol->name = name;
  }
       symbol_set_demangled_name (gsymbol, NULL, &per_bfd->storage_obstack);
@@ -802,19 +869,20 @@ symbol_set_names (struct general_symbol_info *gsymbol,
   if (per_bfd->demangled_names_hash == NULL)
     create_demangled_names_hash (objfile);
 
-  if (linkage_name[len] != '\0')
+  if (linkage_name[linkage_name_len] != '\0')
     {
       char *alloc_name;
 
-      alloc_name = (char *) alloca (len + 1);
-      memcpy (alloc_name, linkage_name, len);
-      alloc_name[len] = '\0';
+      alloc_name = (char *) alloca (linkage_name_len + 1);
+      memcpy (alloc_name, linkage_name, linkage_name_len);
+      alloc_name[linkage_name_len] = '\0';
 
       linkage_name_copy = alloc_name;
     }
   else
     linkage_name_copy = linkage_name;
 
+  entry.mangled_len = linkage_name_len;
   entry.mangled = linkage_name_copy;
   slot = ((struct demangled_name_entry **)
   htab_find_slot (per_bfd->demangled_names_hash,
@@ -847,6 +915,7 @@ symbol_set_names (struct general_symbol_info *gsymbol,
        obstack_alloc (&per_bfd->storage_obstack,
       offsetof (struct demangled_name_entry, demangled)
       + demangled_len + 1));
+  (*slot)->mangled_len = linkage_name_len;
   (*slot)->mangled = linkage_name;
  }
       else
@@ -860,10 +929,11 @@ symbol_set_names (struct general_symbol_info *gsymbol,
     = ((struct demangled_name_entry *)
        obstack_alloc (&per_bfd->storage_obstack,
       offsetof (struct demangled_name_entry, demangled)
-      + len + demangled_len + 2));
+      + linkage_name_len + demangled_len + 2));
   mangled_ptr = &((*slot)->demangled[demangled_len + 1]);
   strcpy (mangled_ptr, linkage_name_copy);
   (*slot)->mangled = mangled_ptr;
+  (*slot)->mangled_len = linkage_name_len;
  }
 
       if (demangled_name != NULL)
--
2.5.5


Reply | Threaded
Open this post in threaded view
|

Re: Note on choosing string hash functions

Dmitry Antipov
On 11/28/2017 03:00 PM, Pedro Alves wrote:

> Anyway, this is just a brain dump from a little investigation I did this past
> weekend, and I think that this area has a lot of scope for improvements, but
> I won't be able to follow up on any of this myself in the next following
> weeks at least, with several gdb 8.1 issues on my platе.

Just for the record, I have a brain dump around this area too :-). Instead of
optimizing htab_hash_string itself, we can try to reduce number of calls (and
drop a few calls to strcmp as well if lucky):

---
  gdb/symtab.c | 10 ++++++++--
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/gdb/symtab.c b/gdb/symtab.c
index ebb7fbe1e0..c9dab4d575 100644
--- a/gdb/symtab.c
+++ b/gdb/symtab.c
@@ -673,6 +673,7 @@ symbol_set_language (struct general_symbol_info *gsymbol,
  struct demangled_name_entry
  {
    const char *mangled;
+  hashval_t hashval;
    char demangled[1];
  };

@@ -684,7 +685,7 @@ hash_demangled_name_entry (const void *data)
    const struct demangled_name_entry *e
      = (const struct demangled_name_entry *) data;

-  return htab_hash_string (e->mangled);
+  return e->hashval;
  }

  /* Equality function for the demangled name hash.  */
@@ -697,7 +698,8 @@ eq_demangled_name_entry (const void *a, const void *b)
    const struct demangled_name_entry *db
      = (const struct demangled_name_entry *) b;

-  return strcmp (da->mangled, db->mangled) == 0;
+  return (da->hashval != db->hashval) ? 0
+    : (strcmp (da->mangled, db->mangled) == 0);
  }

  /* Create the hash table used for demangled names.  Each hash entry is
@@ -816,6 +818,8 @@ symbol_set_names (struct general_symbol_info *gsymbol,
      linkage_name_copy = linkage_name;

    entry.mangled = linkage_name_copy;
+  entry.hashval = htab_hash_string (linkage_name_copy);
+
    slot = ((struct demangled_name_entry **)
           htab_find_slot (per_bfd->demangled_names_hash,
                           &entry, INSERT));
@@ -848,6 +852,7 @@ symbol_set_names (struct general_symbol_info *gsymbol,
                               offsetof (struct demangled_name_entry, demangled)
                               + demangled_len + 1));
           (*slot)->mangled = linkage_name;
+         (*slot)->hashval = entry.hashval;
         }
        else
         {
@@ -864,6 +869,7 @@ symbol_set_names (struct general_symbol_info *gsymbol,
           mangled_ptr = &((*slot)->demangled[demangled_len + 1]);
           strcpy (mangled_ptr, linkage_name_copy);
           (*slot)->mangled = mangled_ptr;
+         (*slot)->hashval = entry.hashval;
         }

        if (demangled_name != NULL)
--

This immediately moves htab_hash_string down in reports, from:

      9.67%  gdb      gdb                       [.] find_pc_sect_psymtab
      6.62%  gdb      gdb                       [.] bcache_full
      5.41%  gdb      libc-2.25.so              [.] tolower
      4.44%  gdb      gdb                       [.] htab_hash_string                    ; 4th place
      4.14%  gdb      gdb                       [.] htab_find_slot_with_hash
      3.86%  gdb      gdb                       [.] load_partial_dies
      3.56%  gdb      gdb                       [.] strcmp_iw_ordered
      3.48%  gdb      gdb                       [.] read_indirect_string_at_offset_from
      3.26%  gdb      gdb                       [.] lookup_minimal_symbol_by_pc_name
      3.12%  gdb      gdb                       [.] cpname_parse
      2.77%  gdb      gdb                       [.] read_attribute_value
      2.72%  gdb      gdb                       [.] cpname_lex
      2.71%  gdb      libc-2.25.so              [.] __strcmp_sse2_unaligned
      2.49%  gdb      gdb                       [.] peek_die_abbrev
      2.30%  gdb      libc-2.25.so              [.] _int_malloc
      2.16%  gdb      gdb                       [.] d_print_comp_inner

to:

      9.78%  gdb      gdb                       [.] find_pc_sect_psymtab
      6.63%  gdb      gdb                       [.] bcache_full
      5.00%  gdb      libc-2.25.so              [.] tolower
      4.20%  gdb      gdb                       [.] htab_find_slot_with_hash
      3.79%  gdb      gdb                       [.] load_partial_dies
      3.68%  gdb      gdb                       [.] lookup_minimal_symbol_by_pc_name
      3.54%  gdb      gdb                       [.] read_indirect_string_at_offset_from
      3.47%  gdb      gdb                       [.] strcmp_iw_ordered
      3.39%  gdb      gdb                       [.] cpname_parse
      2.71%  gdb      gdb                       [.] read_attribute_value
      2.53%  gdb      gdb                       [.] cpname_lex
      2.46%  gdb      gdb                       [.] peek_die_abbrev
      2.43%  gdb      gdb                       [.] d_print_comp_inner
      2.43%  gdb      libc-2.25.so              [.] _int_malloc
      2.38%  gdb      gdb                       [.] htab_hash_string                    ; 15th place

at the cost of having per_bfd->storage_obstack ~3.7% larger for libxul.so with ~2.3M symbols.

Dmitry

Reply | Threaded
Open this post in threaded view
|

Re: Note on choosing string hash functions

Pedro Alves-7
On 11/28/2017 03:13 PM, Dmitry Antipov wrote:

> On 11/28/2017 03:00 PM, Pedro Alves wrote:
>
>> Anyway, this is just a brain dump from a little investigation I did
>> this past
>> weekend, and I think that this area has a lot of scope for
>> improvements, but
>> I won't be able to follow up on any of this myself in the next following
>> weeks at least, with several gdb 8.1 issues on my platе.
>
> Just for the record, I have a brain dump around this area too :-).
> Instead of
> optimizing htab_hash_string itself, we can try to reduce number of calls
> (and
> drop a few calls to strcmp as well if lucky):

Oh, that does sound like a good idea indeed.

Thanks,
Pedro Alves