[patch] malloc per-thread cache ready for review

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[patch] malloc per-thread cache ready for review

DJ Delorie-2

tl;dr...

I'd like to commit this when 2.26 opens for new patches.  It adds a
per-thread cache to malloc.

git checkout dj/malloc-tcache
https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/dj/malloc-tcache

git diff origin/master...origin/dj/malloc-tcache

Tunables:

glibc.malloc.tcache_max=<size_t> - largest cached chunk in bytes
glibc.malloc.tcache_count=<size_t> - most cached chunks per bin
glibc.malloc.tcache_unsorted_limit=<size_t> - how far to search the
        unsorted list for additional cacheable blocks.  (default 0 =
        unlimited)

----------------------------------------
Long version...

Most of you are probably aware that I've been working on malloc
performance improvements for a while now.  The first solid improvement
is ready for review, and I'd like to get it reviewed and ready for
commit when 2.26 opens up for new work.

This change adds a small per-thread cache (called "tcache") to the
standard malloc, which allows many of the malloc/free requests to
avoid doing *any* atomics or locks, shortening the "fast path"
significantly.  The tcache has a maximum block size it caches
(tcache_max, default (and maximum) about 504 bytes on 32-bit systems,
1008 on 64-bit systems) and a maximum number of blocks per bucket
(tcache_count, default 7).  When the tcache has a suitable block,
malloc can return it immediately.  Free can store it immediately if
there's space in the tcache.  If malloc can't use the cache, once it
takes the locks on an arena, it claims additional blocks to pre-fill
the cache while it's looking for a returnable block.

Tested on x86-64 and x86-32, including replacing the system glibc.

Performance-wise, I've seen this code run on average in 83% of the
time of baseline code (i.e. 1.2x faster), see this post for details:
https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html

RSS size is about the same with this change, with a few outliers in
tests-worst-case benchmarks, of course.

The patch re-enables the "experimental malloc" configure option, and
defaults it to "yes".  Disabling it disables the tcache code.

  --disable-experimental-malloc
                          disable experimental malloc features

and adds three tunables:

    tcache_max {
      type: SIZE_T
      env_alias: MALLOC_TCACHE_MAX
    }
    tcache_count {
      type: SIZE_T
      env_alias: MALLOC_TCACHE_COUNT
    }
    tcache_unsorted_limit {
      type: SIZE_T
      env_alias: MALLOC_TCACHE_UNSORTED_LIMIT
    }

The tcache is layered on top of the existing fastbins and smallbins.
Since the tcache is small, fastbins still provide a performance
improvement (albeit less of one) and should not be removed at this
time.  Think of them as L3/L2/L1 caches ;-)

Patch attached.

        * config.make.in: Add experimental-malloc option.
        * configure.ac: Likewise.  Default to enabled.
        * configure: Regenerate.
        * elf/dl-tunables.list: Add per-thread cache tunables.
        * malloc/Makefile: Enable per-thread cache if experimental
        malloc is enabled.
        * malloc/arena.c: Add per-thread tunables.
        * malloc/malloc.c: Add per-thread cache.

diff --git a/config.make.in b/config.make.in
index 5836b32..0290d83 100644
--- a/config.make.in
+++ b/config.make.in
@@ -77,6 +77,8 @@ multi-arch = @multi_arch@
 
 mach-interface-list = @mach_interface_list@
 
+experimental-malloc = @experimental_malloc@
+
 nss-crypt = @libc_cv_nss_crypt@
 static-nss-crypt = @libc_cv_static_nss_crypt@
 
diff --git a/configure.ac b/configure.ac
index 4a77411..b929012 100644
--- a/configure.ac
+++ b/configure.ac
@@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch],
       [multi_arch=$enableval],
       [multi_arch=default])
 
+AC_ARG_ENABLE([experimental-malloc],
+      AC_HELP_STRING([--disable-experimental-malloc],
+     [disable experimental malloc features]),
+      [experimental_malloc=$enableval],
+      [experimental_malloc=yes])
+AC_SUBST(experimental_malloc)
+
 AC_ARG_ENABLE([nss-crypt],
       AC_HELP_STRING([--enable-nss-crypt],
      [enable libcrypt to use nss]),
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index d8cd912..3e49875 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -64,5 +64,17 @@ glibc {
       env_alias: MALLOC_ARENA_TEST
       minval: 1
     }
+    tcache_max {
+      type: SIZE_T
+      env_alias: MALLOC_TCACHE_MAX
+    }
+    tcache_count {
+      type: SIZE_T
+      env_alias: MALLOC_TCACHE_COUNT
+    }
+    tcache_unsorted_limit {
+      type: SIZE_T
+      env_alias: MALLOC_TCACHE_UNSORTED_LIMIT
+    }
   }
 }
diff --git a/malloc/Makefile b/malloc/Makefile
index e93b83b..dd8a43a 100644
--- a/malloc/Makefile
+++ b/malloc/Makefile
@@ -168,6 +168,9 @@ tst-malloc-usable-static-ENV = $(tst-malloc-usable-ENV)
 tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3
 tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV)
 
+ifeq ($(experimental-malloc),yes)
+CPPFLAGS-malloc.c += -DUSE_TCACHE
+endif
 # Uncomment this for test releases.  For public releases it is too expensive.
 #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1
 
diff --git a/malloc/arena.c b/malloc/arena.c
index b91d7d6..74616df 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -236,6 +236,9 @@ DL_TUNABLE_CALLBACK_FNDECL (set_perturb_byte, int32_t)
 DL_TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t)
 DL_TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t)
 DL_TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t)
+DL_TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t)
+DL_TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t)
+DL_TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t)
 #else
 /* Initialization routine. */
 #include <string.h>
@@ -322,6 +325,11 @@ ptmalloc_init (void)
   TUNABLE_SET_VAL_WITH_CALLBACK (mmap_max, NULL, set_mmaps_max);
   TUNABLE_SET_VAL_WITH_CALLBACK (arena_max, NULL, set_arena_max);
   TUNABLE_SET_VAL_WITH_CALLBACK (arena_test, NULL, set_arena_test);
+#if USE_TCACHE
+  TUNABLE_SET_VAL_WITH_CALLBACK (tcache_max, NULL, set_tcache_max);
+  TUNABLE_SET_VAL_WITH_CALLBACK (tcache_count, NULL, set_tcache_count);
+  TUNABLE_SET_VAL_WITH_CALLBACK (tcache_unsorted_limit, NULL, set_tcache_unsorted_limit);
+#endif
   __libc_lock_unlock (main_arena.mutex);
 #else
   const char *s = NULL;
@@ -371,7 +379,23 @@ ptmalloc_init (void)
                   if (memcmp (envline, "ARENA_TEST", 10) == 0)
                     __libc_mallopt (M_ARENA_TEST, atoi (&envline[11]));
                 }
+#if USE_TCACHE
+              if (!__builtin_expect (__libc_enable_secure, 0))
+                {
+                  if (memcmp (envline, "TCACHE_MAX", 10) == 0)
+                    __libc_mallopt (M_TCACHE_MAX, atoi (&envline[11]));
+                }
+#endif
               break;
+#if USE_TCACHE
+            case 12:
+              if (!__builtin_expect (__libc_enable_secure, 0))
+                {
+                  if (memcmp (envline, "TCACHE_COUNT", 12) == 0)
+                    __libc_mallopt (M_TCACHE_COUNT, atoi (&envline[13]));
+                }
+      break;
+#endif
             case 15:
               if (!__builtin_expect (__libc_enable_secure, 0))
                 {
@@ -381,6 +405,15 @@ ptmalloc_init (void)
                     __libc_mallopt (M_MMAP_THRESHOLD, atoi (&envline[16]));
                 }
               break;
+#if USE_TCACHE
+            case 21:
+              if (!__builtin_expect (__libc_enable_secure, 0))
+                {
+                  if (memcmp (envline, "TCACHE_UNSORTED_LIMIT", 21) == 0)
+                    __libc_mallopt (M_TCACHE_UNSORTED_LIMIT, atoi (&envline[22]));
+                }
+      break;
+#endif
             default:
               break;
             }
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 4885793..b618be8 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -297,6 +297,33 @@ __malloc_assert (const char *assertion, const char *file, unsigned int line,
 }
 #endif
 
+#ifndef USE_TCACHE
+# define USE_TCACHE 0
+#endif
+#if USE_TCACHE
+/* We want 64 entries.  This is an arbitrary limit, which tunables can reduce.  */
+# define MAX_TCACHE_SIZE (MALLOC_ALIGNMENT * 63)
+# define TCACHE_IDX ((MAX_TCACHE_SIZE / MALLOC_ALIGNMENT) + 1)
+# define size2tidx_(bytes) (((bytes) + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)
+
+# define tidx2csize(idx) ((idx)*MALLOC_ALIGNMENT + SIZE_SZ)
+# define tidx2usize(idx) ((idx)*MALLOC_ALIGNMENT)
+
+/* When "x" is a user-provided size.  */
+# define usize2tidx(x) size2tidx_(x)
+/* When "x" is from chunksize().  */
+# define csize2tidx(x) size2tidx_((x)-SIZE_SZ)
+
+/* Rounds up, so...
+   idx 0   bytes 0
+   idx 1   bytes 1..8
+   idx 2   bytes 9..16
+   etc.  */
+
+/* This is another arbitrary limit, which tunables can change.  */
+# define TCACHE_FILL_COUNT 7
+#endif
+
 
 /*
   REALLOC_ZERO_BYTES_FREES should be set if a call to
@@ -1711,6 +1738,17 @@ struct malloc_par
 
   /* First address handed out by MORECORE/sbrk.  */
   char *sbrk_base;
+
+#if USE_TCACHE
+  /* Maximum number of buckets to use.  */
+  size_t tcache_max;
+  size_t tcache_max_bytes;
+  /* Maximum number of chunks in each bucket.  */
+  size_t tcache_count;
+  /* Maximum number of chunks to remove from the unsorted list, which
+     don't match.  */
+  size_t tcache_unsorted_limit;
+#endif
 };
 
 /* There are several instances of this struct ("arenas") in this
@@ -1749,8 +1787,22 @@ static struct malloc_par mp_ =
   .trim_threshold = DEFAULT_TRIM_THRESHOLD,
 #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
   .arena_test = NARENAS_FROM_NCORES (1)
+#if USE_TCACHE
+  ,
+  .tcache_count = TCACHE_FILL_COUNT,
+  .tcache_max = TCACHE_IDX,
+  .tcache_max_bytes = tidx2usize (TCACHE_IDX),
+  .tcache_unsorted_limit = 0 /* No limit */
+#endif
 };
 
+/*  Non public mallopt parameters.  */
+#if USE_TCACHE
+# define M_TCACHE_COUNT  -9
+# define M_TCACHE_MAX  -10
+# define M_TCACHE_UNSORTED_LIMIT  -11
+#endif
+
 /* Maximum size of memory handled in fastbins.  */
 static INTERNAL_SIZE_T global_max_fast;
 
@@ -2874,6 +2926,38 @@ mremap_chunk (mchunkptr p, size_t new_size)
 
 /*------------------------ Public wrappers. --------------------------------*/
 
+#if USE_TCACHE
+
+typedef struct TCacheEntry {
+  struct TCacheEntry *next;
+} TCacheEntry;
+
+/* There is one of these for each thread, which contains the
+   per-thread cache (hence "TCache").  Keeping overall size low is
+   mildly important.  INITTED makes sure we don't use the cache when
+   we're shutting down the thread.  Note that COUNTS and ENTRIES are
+   redundant, this is for performance reasons.  */
+typedef struct TCache {
+  /* 0 = uninitted, 1 = normal, anything else = shutting down */
+  char initted;
+  char counts[TCACHE_IDX];
+  TCacheEntry *entries[TCACHE_IDX];
+} TCache;
+
+static __thread TCache tcache = {0,0,0,{0},{0}};
+
+static void __attribute__ ((section ("__libc_thread_freeres_fn")))
+tcache_thread_freeres (void)
+{
+  /* We should flush the cache out to the main arena also, but most of
+     the time we're just exiting the process completely.  */
+  if (tcache.initted == 1)
+    tcache.initted = 2;
+}
+text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
+
+#endif
+
 void *
 __libc_malloc (size_t bytes)
 {
@@ -2884,6 +2968,22 @@ __libc_malloc (size_t bytes)
     = atomic_forced_read (__malloc_hook);
   if (__builtin_expect (hook != NULL, 0))
     return (*hook)(bytes, RETURN_ADDRESS (0));
+#if USE_TCACHE
+  /* int_free also calls request2size, be careful to not pad twice.  */
+  size_t tbytes = request2size (bytes);
+  size_t tc_idx = csize2tidx (tbytes);
+
+  if (tc_idx < mp_.tcache_max
+      && tc_idx < TCACHE_IDX /* to appease gcc */
+      && tcache.entries[tc_idx] != NULL
+      && tcache.initted == 1)
+    {
+      TCacheEntry *e = tcache.entries[tc_idx];
+      tcache.entries[tc_idx] = e->next;
+      --tcache.counts[tc_idx];
+      return (void *) e;
+    }
+#endif
 
   arena_get (ar_ptr, bytes);
 
@@ -3335,6 +3435,10 @@ _int_malloc (mstate av, size_t bytes)
   mchunkptr fwd;                    /* misc temp for linking */
   mchunkptr bck;                    /* misc temp for linking */
 
+#if USE_TCACHE
+  size_t tcache_unsorted_count;    /* count of unsorted chunks processed */
+#endif
+
   const char *errstr = NULL;
 
   /*
@@ -3387,6 +3491,38 @@ _int_malloc (mstate av, size_t bytes)
               return NULL;
             }
           check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+  /* While we're here, if we see other chunks of the same size,
+     stash them in the tcache.  */
+  size_t tc_idx = csize2tidx (nb);
+  if (tc_idx < mp_.tcache_max)
+    {
+      mchunkptr tc_victim;
+      int found = 0;
+
+      /* While bin not empty and tcache not full, copy chunks over.  */
+      while (tcache.counts[tc_idx] < mp_.tcache_count
+     && (pp = *fb) != NULL)
+ {
+  do
+    {
+              tc_victim = pp;
+              if (tc_victim == NULL)
+                break;
+            }
+          while ((pp = catomic_compare_and_exchange_val_acq (fb, tc_victim->fd, tc_victim))
+                 != tc_victim);
+  if (tc_victim != 0)
+    {
+      TCacheEntry *e = (TCacheEntry *) chunk2mem (tc_victim);
+      e->next = tcache.entries[tc_idx];
+      tcache.entries[tc_idx] = e;
+      ++tcache.counts[tc_idx];
+      ++found;
+            }
+ }
+    }
+#endif
           void *p = chunk2mem (victim);
           alloc_perturb (p, bytes);
           return p;
@@ -3425,6 +3561,37 @@ _int_malloc (mstate av, size_t bytes)
               if (av != &main_arena)
  set_non_main_arena (victim);
               check_malloced_chunk (av, victim, nb);
+#if USE_TCACHE
+  /* While we're here, if we see other chunks of the same size,
+     stash them in the tcache.  */
+  size_t tc_idx = csize2tidx (nb);
+  if (tc_idx < mp_.tcache_max)
+    {
+      mchunkptr tc_victim;
+      int found = 0;
+
+      /* While bin not empty and tcache not full, copy chunks over.  */
+      while (tcache.counts[tc_idx] < mp_.tcache_count
+     && (tc_victim = last (bin)) != bin)
+ {
+  if (tc_victim != 0)
+    {
+      bck = tc_victim->bk;
+      set_inuse_bit_at_offset (tc_victim, nb);
+      if (av != &main_arena)
+ set_non_main_arena (tc_victim);
+      bin->bk = bck;
+      bck->fd = bin;
+
+      TCacheEntry *e = (TCacheEntry *) chunk2mem (tc_victim);
+      e->next = tcache.entries[tc_idx];
+      tcache.entries[tc_idx] = e;
+      ++tcache.counts[tc_idx];
+      ++found;
+            }
+ }
+    }
+#endif
               void *p = chunk2mem (victim);
               alloc_perturb (p, bytes);
               return p;
@@ -3463,6 +3630,16 @@ _int_malloc (mstate av, size_t bytes)
      otherwise need to expand memory to service a "small" request.
    */
 
+#if USE_TCACHE
+  INTERNAL_SIZE_T tcache_nb = 0;
+  if (csize2tidx (nb) <= mp_.tcache_max)
+    tcache_nb = nb;
+  size_t tc_idx = csize2tidx (nb);
+  int return_cached = 0;
+
+  tcache_unsorted_count = 0;
+#endif
+
   for (;; )
     {
       int iters = 0;
@@ -3523,10 +3700,29 @@ _int_malloc (mstate av, size_t bytes)
               set_inuse_bit_at_offset (victim, size);
               if (av != &main_arena)
  set_non_main_arena (victim);
+#if USE_TCACHE
+      /* Fill cache first, return to user only if cache fills.
+ We may return one of these chunks later.  */
+      if (tcache_nb
+  && tcache.counts[tc_idx] < mp_.tcache_count)
+ {
+  TCacheEntry *e = (TCacheEntry *) chunk2mem (victim);
+  e->next = tcache.entries[tc_idx];
+  tcache.entries[tc_idx] = e;
+  ++tcache.counts[tc_idx];
+  return_cached = 1;
+  continue;
+ }
+      else
+ {
+#endif
               check_malloced_chunk (av, victim, nb);
               void *p = chunk2mem (victim);
               alloc_perturb (p, bytes);
               return p;
+#if USE_TCACHE
+ }
+#endif
             }
 
           /* place chunk in bin */
@@ -3593,11 +3789,37 @@ _int_malloc (mstate av, size_t bytes)
           fwd->bk = victim;
           bck->fd = victim;
 
+#if USE_TCACHE
+      /* If we've processed as many chunks as we're allowed while
+ filling the cache, return one of the cached ones.  */
+      ++tcache_unsorted_count;
+      if (return_cached
+  && mp_.tcache_unsorted_limit > 0
+  && tcache_unsorted_count > mp_.tcache_unsorted_limit)
+ {
+  TCacheEntry *e = tcache.entries[tc_idx];
+  tcache.entries[tc_idx] = e->next;
+  --tcache.counts[tc_idx];
+  return (void *) e;
+ }
+#endif
+
 #define MAX_ITERS       10000
           if (++iters >= MAX_ITERS)
             break;
         }
 
+#if USE_TCACHE
+      /* If all the small chunks we found ended up cached, return one now.  */
+      if (return_cached)
+ {
+  TCacheEntry *e = tcache.entries[tc_idx];
+  tcache.entries[tc_idx] = e->next;
+  --tcache.counts[tc_idx];
+  return (void *) e;
+ }
+#endif
+
       /*
          If a large request, scan through the chunks of current bin in
          sorted order to find smallest that fits.  Use the skip list for this.
@@ -3883,6 +4105,23 @@ _int_free (mstate av, mchunkptr p, int have_lock)
 
   check_inuse_chunk(av, p);
 
+#if USE_TCACHE
+  {
+    size_t tc_idx = csize2tidx (size);
+
+    if (tc_idx < mp_.tcache_max
+ && tcache.counts[tc_idx] < mp_.tcache_count
+ && tcache.initted == 1)
+      {
+ TCacheEntry *e = (TCacheEntry *) chunk2mem (p);
+ e->next = tcache.entries[tc_idx];
+ tcache.entries[tc_idx] = e;
+ ++tcache.counts[tc_idx];
+ return;
+      }
+  }
+#endif
+
   /*
     If eligible, place chunk on a fastbin so it can be found
     and used quickly in malloc.
@@ -4844,6 +5083,38 @@ do_set_arena_max (size_t value)
   return 1;
 }
 
+#ifdef USE_TCACHE
+static inline int
+__always_inline
+do_set_tcache_max (size_t value)
+{
+  LIBC_PROBE (memory_mallopt_tcache_max_bytes, 2, value, mp_.tcache_max_bytes);
+  if (value >= 0 && value <= MAX_TCACHE_SIZE)
+    {
+      mp_.tcache_max_bytes = value;
+      mp_.tcache_max = usize2tidx (value) + 1;
+    }
+  return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_count (size_t value)
+{
+  LIBC_PROBE (memory_mallopt_tcache_count, 2, value, mp_.tcache_count);
+  mp_.tcache_count = value;
+  return 1;
+}
+
+static inline int
+__always_inline
+do_set_tcache_unsorted_limit (size_t value)
+{
+  LIBC_PROBE (memory_mallopt_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit);
+  mp_.tcache_unsorted_limit = value;
+  return 1;
+}
+#endif
 
 int
 __libc_mallopt (int param_number, int value)
@@ -4904,6 +5175,20 @@ __libc_mallopt (int param_number, int value)
       if (value > 0)
  do_set_arena_test (value);
       break;
+#if USE_TCACHE
+    case M_TCACHE_COUNT:
+      if (value >= 0)
+ do_set_tcache_count (value);
+      break;
+    case M_TCACHE_MAX:
+      if (value >= 0)
+ do_set_tcache_max (value);
+      break;
+    case M_TCACHE_UNSORTED_LIMIT:
+      if (value >= 0)
+ do_set_tcache_unsorted_limit (value);
+      break;
+#endif
     }
   __libc_lock_unlock (av->mutex);
   return res;
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Szabolcs Nagy-2
On 26/01/17 19:29, DJ Delorie wrote:
> +typedef struct TCache {
> +  /* 0 = uninitted, 1 = normal, anything else = shutting down */
> +  char initted;
> +  char counts[TCACHE_IDX];
> +  TCacheEntry *entries[TCACHE_IDX];
> +} TCache;
> +
> +static __thread TCache tcache = {0,0,0,{0},{0}};

this initializer seems to have two excess 0

(i'd just remove the initializer if it's static __thread)


Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

DJ Delorie-2

Szabolcs Nagy <[hidden email]> writes:
> this initializer seems to have two excess 0

Fixed.  Funny, it built both ways... :-(

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

DJ Delorie-2
In reply to this post by DJ Delorie-2

Well, thank you for testing it at least :-)

I'll work on debugging that.  Hopefully it's something stupid I "fixed"
recently, and not a design flaw somewhere...
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Carlos O'Donell-6
On 01/29/2017 01:16 PM, DJ Delorie wrote:
>
> Well, thank you for testing it at least :-)
>
> I'll work on debugging that.  Hopefully it's something stupid I "fixed"
> recently, and not a design flaw somewhere...
 
Whatever email Markus sent you is not on the list?

Markus, would you mind sharing on list?

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

DJ Delorie-2
In reply to this post by DJ Delorie-2

Markus Trippelsdorf <[hidden email]> writes:
> Unfortunately it turned out to be quite unusable.

Fixed.  Turned out to be <= where I meant <

Fix tested by building a full glibc rpm set, installing it, and building
the full rpm set again using the just-installed glibc rpms.

Please test again :-)

Thanks!
DJ
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Markus Trippelsdorf
On 2017.02.01 at 01:33 -0500, DJ Delorie wrote:

>
> Markus Trippelsdorf <[hidden email]> writes:
> > Unfortunately it turned out to be quite unusable.
>
> Fixed.  Turned out to be <= where I meant <
>
> Fix tested by building a full glibc rpm set, installing it, and building
> the full rpm set again using the just-installed glibc rpms.
>
> Please test again :-)

Thanks. It works fine now. But I see no speedups for a simple compile
time test:

 % time g++ -w -Ofast tramp3d-v4.cpp
g++ -w -Ofast tramp3d-v4.cpp  26.42s user 0.31s system 99% cpu 26.756 total
(this is the same time as without your patch)

Using the Lockless Memory Allocator (http://locklessinc.com/articles/allocator_tricks/)
I get ~4% speedup:

 % time LD_PRELOAD=/usr/lib/libllalloc.so.1.3 g++ -w -Ofast tramp3d-v4.cpp
LD_PRELOAD=/usr/lib/libllalloc.so.1.3 g++ -w -Ofast tramp3d-v4.cpp  25.25s user 0.38s system 99% cpu 25.659 total

--
Markus
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

DJ Delorie-2
Markus Trippelsdorf <[hidden email]> writes:
> Thanks. It works fine now.

Excellent :-)

> But I see no speedups for a simple compile time test:

GCC is single threaded and has its own internal memory management system
it layers on top of malloc, and I've not seen it benefit from a
per-thread cache.

I posted benchmark results here so you can see what type of applications
are affected by tcache:
https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Jeff Law
On 02/01/2017 09:14 AM, DJ Delorie wrote:

> Markus Trippelsdorf <[hidden email]> writes:
>> Thanks. It works fine now.
>
> Excellent :-)
>
>> But I see no speedups for a simple compile time test:
>
> GCC is single threaded and has its own internal memory management system
> it layers on top of malloc, and I've not seen it benefit from a
> per-thread cache.
Actually most of GCC's key internal structures are layered on top of
mmap'd pages.  It doesn't go through malloc for any of the key data
structures.

jeff
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Alexander Monakov-2
On Wed, 1 Feb 2017, Jeff Law wrote:

> On 02/01/2017 09:14 AM, DJ Delorie wrote:
> > Markus Trippelsdorf <[hidden email]> writes:
> > > Thanks. It works fine now.
> >
> > Excellent :-)
> >
> > > But I see no speedups for a simple compile time test:
> >
> > GCC is single threaded and has its own internal memory management system
> > it layers on top of malloc, and I've not seen it benefit from a
> > per-thread cache.
> Actually most of GCC's key internal structures are layered on top of mmap'd
> pages.  It doesn't go through malloc for any of the key data structures.

As I recall from looking at profiling data, GCC's C++ frontend is nevertheless a
heavy user of libc malloc/free.  Apart from that, GCC uses its custom GC-capable
allocator extensively.

Alexander
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Markus Trippelsdorf
On 2017.02.01 at 19:23 +0300, Alexander Monakov wrote:

> On Wed, 1 Feb 2017, Jeff Law wrote:
>
> > On 02/01/2017 09:14 AM, DJ Delorie wrote:
> > > Markus Trippelsdorf <[hidden email]> writes:
> > > > Thanks. It works fine now.
> > >
> > > Excellent :-)
> > >
> > > > But I see no speedups for a simple compile time test:
> > >
> > > GCC is single threaded and has its own internal memory management system
> > > it layers on top of malloc, and I've not seen it benefit from a
> > > per-thread cache.
> > Actually most of GCC's key internal structures are layered on top of mmap'd
> > pages.  It doesn't go through malloc for any of the key data structures.
>
> As I recall from looking at profiling data, GCC's C++ frontend is nevertheless a
> heavy user of libc malloc/free.  Apart from that, GCC uses its custom GC-capable
> allocator extensively.

Exactly. For C++ projects you can easily get a 6-8% build time speedup
by using an alternative allocator like
http://locklessinc.com/downloads/lockless_allocator_src.tgz (the best in
my testing) or jemalloc.

--
Markus
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

DJ Delorie-2

Markus Trippelsdorf <[hidden email]> writes:
> http://locklessinc.com/downloads/lockless_allocator_src.tgz (the best in
> my testing) or jemalloc.

Before we go down the "which allocator is best" road... glibc's
allocator is intended to be a general purpose "reasonably good enough"
system allocator.  It's easy to find a specific allocator that beats it
in a specific test, but being a specifically best allocator is not our
goal here - providing an allocator that can be the default on a
Linux-based system is.

Hence, my goal with the per-thread cache is to make it "generally
better" for overall system performance.

I am not trying to make it better than every other allocator in every
case, that's a futile exercise.
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Zack Weinberg-2
On Wed, Feb 1, 2017 at 11:44 AM, DJ Delorie <[hidden email]> wrote:
>
> Before we go down the "which allocator is best" road... glibc's
> allocator is intended to be a general purpose "reasonably good enough"
> system allocator.  It's easy to find a specific allocator that beats it
> in a specific test, but being a specifically best allocator is not our
> goal here - providing an allocator that can be the default on a
> Linux-based system is.

Still, I would hope that if an existing alternative implementation
(such as either of those mentioned by Markus) is found to be
_consistently_ better than what we have, we would consider adopting
it.

(The difficulty is of course defining "consistently".)

zw
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Carlos O'Donell-6
In reply to this post by DJ Delorie-2
On 02/01/2017 11:44 AM, DJ Delorie wrote:

>
> Markus Trippelsdorf <[hidden email]> writes:
>> http://locklessinc.com/downloads/lockless_allocator_src.tgz (the best in
>> my testing) or jemalloc.
>
> Before we go down the "which allocator is best" road... glibc's
> allocator is intended to be a general purpose "reasonably good enough"
> system allocator.  It's easy to find a specific allocator that beats it
> in a specific test, but being a specifically best allocator is not our
> goal here - providing an allocator that can be the default on a
> Linux-based system is.
>
> Hence, my goal with the per-thread cache is to make it "generally
> better" for overall system performance.
>
> I am not trying to make it better than every other allocator in every
> case, that's a futile exercise.

Agreed.

The word "best" needs to be qualified by a workload.

If you can provide a workload that performs poorly on glibc malloc and
better on jemalloc for some parameter, then it can be compared more
objectively.

The glibc community broadly agrees that the way to move this forward
is through whole system benchmarking, looking at complete APIs,
gathering traces, and seeing what the workloads are doing and how
to improve the behaviour of certain parameters.

See Red Hat's LPC 2016 presentation and DJ's work on dj/malloc:
https://linuxplumbersconf.org/2016/ocw/system/presentations/3921/original/LPC%202016%20-%20linux%20and%20glibc_%20The%204.5TiB%20malloc%20API%20trace.pdf

Please ask for any help if you are interested in tracing and simulating
the malloc API using the trace/simulation in dj/malloc.

I'm excited to move DJ's tcache work forward since it provides glibc
malloc with an important performance feature already present in tcmalloc
and jemalloc.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Carlos O'Donell-6
In reply to this post by Zack Weinberg-2
On 02/01/2017 12:39 PM, Zack Weinberg wrote:

> On Wed, Feb 1, 2017 at 11:44 AM, DJ Delorie <[hidden email]> wrote:
>>
>> Before we go down the "which allocator is best" road... glibc's
>> allocator is intended to be a general purpose "reasonably good enough"
>> system allocator.  It's easy to find a specific allocator that beats it
>> in a specific test, but being a specifically best allocator is not our
>> goal here - providing an allocator that can be the default on a
>> Linux-based system is.
>
> Still, I would hope that if an existing alternative implementation
> (such as either of those mentioned by Markus) is found to be
> _consistently_ better than what we have, we would consider adopting
> it.
>
> (The difficulty is of course defining "consistently".)

I don't think that's entirely true. There are non-technical issues
of copyright and licensing to consider.

See my other email in this thread to point out that we need workload
captures to generate a corpus of data to evaluate allocators.

I think there is a lot of value in glibc's malloc, and the differences
seen by most are page-based vs. heap-based allocator issues, and
how those choices tie directly into application usage patterns e.g.
producer-consumer threads vs. stack-based state machines.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Markus Trippelsdorf
In reply to this post by DJ Delorie-2
On 2017.02.01 at 11:44 -0500, DJ Delorie wrote:

>
> Markus Trippelsdorf <[hidden email]> writes:
> > http://locklessinc.com/downloads/lockless_allocator_src.tgz (the best in
> > my testing) or jemalloc.
>
> Before we go down the "which allocator is best" road... glibc's
> allocator is intended to be a general purpose "reasonably good enough"
> system allocator.  It's easy to find a specific allocator that beats it
> in a specific test, but being a specifically best allocator is not our
> goal here - providing an allocator that can be the default on a
> Linux-based system is.
>
> Hence, my goal with the per-thread cache is to make it "generally
> better" for overall system performance.
>
> I am not trying to make it better than every other allocator in every
> case, that's a futile exercise.

Well, there wouldn't be a reason for all these alternative allocators if
glibc's would be "reasonably good". In fact is often astonishingly bad.

Examples are all major browsers (using jemalloc or tcmalloc) and Rust
(jemalloc gets linkend in for all generated binaries by default).


--
Markus
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Markus Trippelsdorf
In reply to this post by DJ Delorie-2
On 2017.02.01 at 01:33 -0500, DJ Delorie wrote:

>
> Markus Trippelsdorf <[hidden email]> writes:
> > Unfortunately it turned out to be quite unusable.
>
> Fixed.  Turned out to be <= where I meant <
>
> Fix tested by building a full glibc rpm set, installing it, and building
> the full rpm set again using the just-installed glibc rpms.
>
> Please test again :-)

Thanks. It works fine now. But I see no speedups for a simple compile
time test:

 % time g++ -w -Ofast tramp3d-v4.cpp
g++ -w -Ofast tramp3d-v4.cpp  26.42s user 0.31s system 99% cpu 26.756 total
(this is the same time as without your patch)

Using the Lockless Memory Allocator (http://locklessinc.com/articles/allocator_tricks/)
I get ~4% speedup:

 % time LD_PRELOAD=/usr/lib/libllalloc.so.1.3 g++ -w -Ofast tramp3d-v4.cpp
LD_PRELOAD=/usr/lib/libllalloc.so.1.3 g++ -w -Ofast tramp3d-v4.cpp  25.25s user 0.38s system 99% cpu 25.659 total

--
Markus
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Carlos O'Donell-6
In reply to this post by Markus Trippelsdorf
On 02/01/2017 11:54 AM, Markus Trippelsdorf wrote:

> On 2017.02.01 at 11:44 -0500, DJ Delorie wrote:
>>
>> Markus Trippelsdorf <[hidden email]> writes:
>>> http://locklessinc.com/downloads/lockless_allocator_src.tgz (the best in
>>> my testing) or jemalloc.
>>
>> Before we go down the "which allocator is best" road... glibc's
>> allocator is intended to be a general purpose "reasonably good enough"
>> system allocator.  It's easy to find a specific allocator that beats it
>> in a specific test, but being a specifically best allocator is not our
>> goal here - providing an allocator that can be the default on a
>> Linux-based system is.
>>
>> Hence, my goal with the per-thread cache is to make it "generally
>> better" for overall system performance.
>>
>> I am not trying to make it better than every other allocator in every
>> case, that's a futile exercise.
>
> Well, there wouldn't be a reason for all these alternative allocators if
> glibc's would be "reasonably good". In fact is often astonishingly bad.

Given apriori knowledge of the workload allows you choose an allocator
whose semantics match your allocation pattern and that improves
performance, and memory usage.

There are many more allocators than tcmalloc and jemalloc, and many more
embedded allocators in projects that you don't readily have visibility
into.

For a general purpose allocator the performance of the allocator can
only be measured against a given corpus of workloads.

To this day no serious corpus of workloads has been collected to measure
allocators against. All the academic papers I've seen only test against
a few workloads.

I hope that within glibc we can gather up workloads to test the allocator
and raise the performance and quality. We have started gathering malloc
traces for just this purpose.

Regarding your comments about glibc malloc being astonishingly bad, do
you have a reference to such a workload? I am looking for _real_
workloads not synthetic ones created to show worst case behaviour in
heap-based allocators (dlmalloc, ptmalloc, and glibc's malloc).

> Examples are all major browsers (using jemalloc or tcmalloc) and Rust
> (jemalloc gets linkend in for all generated binaries by default).

TLDR; I don't think any of these examples chose jemalloc because glibc's
malloc was bad, but because they wanted to offer a choice, and fix
Windows flaws.

I can only speculate here because few projects provide detailed analysis
backed up by real data as the rationale for switching to an alternate
allocator.

Firstly the browsers were looking for a cross-OS solution to solving
memory fragmentation issues in Windows, something we don't specifically
cater to in glibc, but which the portable jemalloc did solve. This was
in the FF3 era when jemalloc was added and it solved the Windows XP
fragmentation issues.

The major browsers use forks of jemalloc. As far as I can tell the forks
are no longer distinguishable from the originals after their modifications
e.g. jemalloc vs. mozjemalloc. Though they do merge in jemalloc enhancements.

I have heard that Firefox needed fast fixed-size caches, and jemalloc was
a suitably licensed and flexible framework to build that with (glibc
was not). Similarly jemalloc is bundled as a permissively licensed
allocator upon which to build your own allocator, and it has become the
defacto standard for that.

Also note that Intel has memkind intended for a NUMA-aware API and that
is built around jemalloc. Here the discussion is much more about a modern
allocator API that supports future NUMA systems. The API itself if well
defined could be implemented in glibc's malloc IMO.

Rust has a concept of "custom allocators" so they didn't choose jemalloc
for any particular reason I can see, just that they acknowledge that
'system_malloc' might not meet your application needs and so provide
alternate allocators, including jemalloc, though Rust could add any
other allocators, and so can users via the custom allocators interface.

Again, nobody was saying glibc malloc was terrible in either of these
cases.

I have nothing but great esteem for the jemalloc and tcmalloc developers,
the allocators are brilliant and very good, and I think that perhaps
page-based allocators may be the better solution for a general purpose
allocator. However, I want to gather the requisite workloads to show
that, and make an informed decision for glibc.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

Markus Trippelsdorf
On 2017.02.01 at 15:50 -0500, Carlos O'Donell wrote:

> On 02/01/2017 11:54 AM, Markus Trippelsdorf wrote:
> > On 2017.02.01 at 11:44 -0500, DJ Delorie wrote:
> >>
> >> Markus Trippelsdorf <[hidden email]> writes:
> >>> http://locklessinc.com/downloads/lockless_allocator_src.tgz (the best in
> >>> my testing) or jemalloc.
> >>
> >> Before we go down the "which allocator is best" road... glibc's
> >> allocator is intended to be a general purpose "reasonably good enough"
> >> system allocator.  It's easy to find a specific allocator that beats it
> >> in a specific test, but being a specifically best allocator is not our
> >> goal here - providing an allocator that can be the default on a
> >> Linux-based system is.
> >>
> >> Hence, my goal with the per-thread cache is to make it "generally
> >> better" for overall system performance.
> >>
> >> I am not trying to make it better than every other allocator in every
> >> case, that's a futile exercise.
> >
> > Well, there wouldn't be a reason for all these alternative allocators if
> > glibc's would be "reasonably good". In fact is often astonishingly bad.
>
> Given apriori knowledge of the workload allows you choose an allocator
> whose semantics match your allocation pattern and that improves
> performance, and memory usage.
>
> There are many more allocators than tcmalloc and jemalloc, and many more
> embedded allocators in projects that you don't readily have visibility
> into.
>
> For a general purpose allocator the performance of the allocator can
> only be measured against a given corpus of workloads.
>
> To this day no serious corpus of workloads has been collected to measure
> allocators against. All the academic papers I've seen only test against
> a few workloads.
>
> I hope that within glibc we can gather up workloads to test the allocator
> and raise the performance and quality. We have started gathering malloc
> traces for just this purpose.
>
> Regarding your comments about glibc malloc being astonishingly bad, do
> you have a reference to such a workload? I am looking for _real_
> workloads not synthetic ones created to show worst case behaviour in
> heap-based allocators (dlmalloc, ptmalloc, and glibc's malloc).

Using google turns up several examples. For example at Facebook
switching to jemalloc doubled their server throughput:
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/
https://www.facebook.com/Engineering/videos/696488619305/

I have mentioned compile times of C++ projects. Here an alternative
allocator typically decreases build times by 6-8% (I have measured this
myself).

> > Examples are all major browsers (using jemalloc or tcmalloc) and Rust
> > (jemalloc gets linkend in for all generated binaries by default).
>
> TLDR; I don't think any of these examples chose jemalloc because glibc's
> malloc was bad, but because they wanted to offer a choice, and fix
> Windows flaws.
>
> I can only speculate here because few projects provide detailed analysis
> backed up by real data as the rationale for switching to an alternate
> allocator.
>
> Firstly the browsers were looking for a cross-OS solution to solving
> memory fragmentation issues in Windows, something we don't specifically
> cater to in glibc, but which the portable jemalloc did solve. This was
> in the FF3 era when jemalloc was added and it solved the Windows XP
> fragmentation issues.
>
> The major browsers use forks of jemalloc. As far as I can tell the forks
> are no longer distinguishable from the originals after their modifications
> e.g. jemalloc vs. mozjemalloc. Though they do merge in jemalloc enhancements.

Well, Chromium uses tcmalloc by default.
And I don't think that Windows was the main reason for the switch.
More likely the multi threaded nature of modern browsers requires a
better allocator.

In this case DJ's per-thread cache patch would help, I guess.

--
Markus
Reply | Threaded
Open this post in threaded view
|

Re: [patch] malloc per-thread cache ready for review

DJ Delorie-2

Markus Trippelsdorf <[hidden email]> writes:
> Using google turns up several examples.

The trace capture and simulation tools are in the dj/malloc branch,
please feel free to ask Facebook to capture their workload so we can
benchmark our malloc changes against them...

(only partly joking here, I'll happily accept a trace from them, and
it's the only way to make sure their interests are represented in our
work)

> I have mentioned compile times of C++ projects. Here an alternative
> allocator typically decreases build times by 6-8% (I have measured this
> myself).

If you can reproduce those measurements with a captured workload, that
would be a useful addition to our corpus.  Yes, I could do it myself,
but that doesn't guarantee I'm testing the specific inputs and
environment you have, which a workload captures for me.

Hmmm... "send me a workload" is going to be my default response to
malloc performance questions from now on, I think...

> More likely the multi threaded nature of modern browsers requires a
> better allocator.

Yes, glibc's malloc design pre-dates the popularity of multi-threaded
apps, and we've been playing catchup.  Our hot path is very long
compared to other allocators, which is why I chose the tcache as a first
attempt at shortening it.

> In this case DJ's per-thread cache patch would help, I guess.

The whole point of capturing workloads is to eliminate the "I guess".
Let's get a workload and find out if it helps or not :-)
12