jemalloc websitejemalloc - General purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. [BSD] website

jemalloc is a general purpose malloc(3) implementation that emphasizes
fragmentation avoidance and scalable concurrency support.  jemalloc first came
into use as the FreeBSD libc allocator in 2005, and since then it has found its
way into numerous applications that rely on its predictable behavior.  In 2010
jemalloc development efforts broadened to include developer support features
such as heap profiling and extensive monitoring/tuning hooks.  Modern jemalloc
releases continue to be integrated back into FreeBSD, and therefore versatility
remains critical.  Ongoing development efforts trend toward making jemalloc
among the best allocators for a broad range of demanding applications, and
eliminating/mitigating weaknesses that have practical repercussions for real
world applications.

The COPYING file contains copyright and licensing information.

The INSTALL file contains information on how to configure, build, and install
jemalloc.

The ChangeLog file contains a brief summary of changes for each release.

URL: http://jemalloc.net/
Comments
  • OpenJDK JVM deadlock triggered with jemalloc 5.x?

    OpenJDK JVM deadlock triggered with jemalloc 5.x?

    We've found an issue that we can only reproduce when LD_PRELOADing jemalloc 5.1. The issue causes a deadlock where one or more threads are waiting to lock an object monitor that no thread is currently holding. When we attempted to debug, we discovered the thread that is expected to be holding the lock has instead left the synchronized block and has returned to the thread pool. If we don't LD_PRELOAD jemalloc and rely on glibc malloc we're unable to reproduce the issue. If we use jemalloc 4.5.0 we're unable to reproduce it there as well.

    We've written a simple test application that can reproduce the issue.

    Test application - https://github.com/djscholl/jemalloc-java-deadlock Example log output when a deadlock occurred - https://gist.github.com/djscholl/413071ef29671fb53b5a64e105421f1a

    In the example log output, we can see that 999 threads are

            -  blocked on [email protected]
    

    Which is supposedly owned by "pool-1-thread-558", but that thread doesn't appear to be holding the monitor.

    "pool-1-thread-558" Id=566 RUNNABLE
            at java.lang.Class.getConstructor(Class.java:1825)
            at java.security.Provider$Service.newInstance(Provider.java:1594)
            at sun.security.jca.GetInstance.getInstance(GetInstance.java:236)
            at sun.security.jca.GetInstance.getInstance(GetInstance.java:164)
            at java.security.Security.getImpl(Security.java:695)
            at java.security.MessageDigest.getInstance(MessageDigest.java:167)
            at com.example.DeadlockTest.lambda$null$1(DeadlockTest.java:44)
            at com.example.DeadlockTest$$Lambda$7/2093176254.run(Unknown Source)
            ...
    
            Number of locked synchronizers = 1
            - [email protected]
    

    We've reproduced this using a jemalloc.so built with everything set to the default and no malloc options set. We've tried it with --enable-debug and can reproduce it with that as well. We've tried with background threads enabled and disabled, as well as changing number of arenas. All of these configurations still result in reproducing this issue.

    We've tried this on a few different OpenJDK versions as well.

    • Multiple versions of 1.8 between 1.8.0.121 and 1.8.0.202
    • 1.9
    • 1.11

    It's a little awkward to post a JVM dead lock to jemalloc developers, but we're stuck! We'll post this same issue on OpenJDK and link that here when we have it.

    We tried running the test application with -DwaitOnDeadLock=true and attached gdb, and all we see is that the threads are in pthread_cond_wait while the thread ID that they're waiting for is elsewhere outside of the synchronized block (so it shouldn't be holding the object monitor any more).

  • jemalloc leads to segmentation fault in execution that is clean in valgrind with the system allocator

    jemalloc leads to segmentation fault in execution that is clean in valgrind with the system allocator

    This has been a recurring issue for us in the development of some of our software (salmon and pufferfish). Essentially, when we link with jemalloc we intermittently get segfaults. Often, re-building the application will "resolve" the issue. Of course, this is almost always a sign of an underlying memory issue in the client program, and so I had previously assumed the same was the case here (despite the fact that these crashes don't happen with the system allocator).

    However, we've gone to considerable extents to rule out (though we can never completely eliminate) the possibility of memory errors in the client code. This recent issue in the pufferfish repository demonstrates one such example where the program repeatedly segfaults when linked with jemalloc. However, it runs to completion when linked with the system allocator. Further, evaluating the execution (under the system allocator) with valgrind shows 0 memory errors (similar results are obtained using the address and memory sanitizers as well). These random crashes have been plaguing certain builds of our tools for a couple of years now, so it would be great to figure out what is going on, and to either pin it down to some memory error in our code that affects jemalloc (but which isn't detectable with valgrind) or to figure out the underlying issue in jemalloc.

  • Make last-N dumping non-blocking

    Make last-N dumping non-blocking

    A few remarks:

    • I'm chopping last-N dumping into batches, each under the prof_recent_alloc_mtx, so that sampled malloc and free can proceed between the batches, rather than being blocked until the entire dumping process finishes.
    • I'm using the existing prof_dump_mtx to cover the entire dumping process, during which I first change the limit to unlimited (so that existing records can stay), then perform the dumping batches, and finally revert the limit back (and shorten the record list). prof_dump_mtx serves to only permit one thread at a time to dump, either the last-N records or the original stacktrace-based profiling information. An alternative approach is to use a separate mutex for the last-N records, so that the two types of dumping can take place concurrently. Thoughts?
    • I'm additionally changing the mallctl logic for reading and writing the limit: they now need the prof_dump_mtx. For reading, this ensures that what's being read is always the real limit. For writing, this ensures that the application cannot change the limit during dumping. The downside is that the mallctl calls are blocked until the entire dumping process finishes, but I think it's fine, because the mallctl calls are very rare and only initiated by the application.
    • I'm increasing the buffer size to be the same as the size used by stats printing and the original profiling dumping. I think I could even consolidate the last-N buffer with the original profiling buffer, especially since I'm already using prof_dump_mtx. Thoughts? I could have a separate commit for that, since that'd also need some refactoring of the original profiling dumping logic.
    • The batch size is chosen to be 64. I figured making such a choice is quite tricky and here's how I get it -
      • The goal I'm pursuing is to find a batch size so that each batch can trigger at most one I/O procedural call: the worst case blocking time is always at least one I/O, so a smaller batch size cannot reduce the worst case blocking time, while a larger batch size can multiply the worst case blocking time.
      • The amount of output per record depends primarily on (a) the length of the stack trace and (b) whether the record has been released (two stack traces if released; one if not). I examined last-N dumps from production, 4 per service, and found that one of the services happened to have both the longest stack traces and the highest proportion of released records, and the average length per record for that service is in the order of 800-900 characters (in compact JSON format). So, if I set the batch size to be 64 records, each batch will at most output less than but close to 64K characters, which is the size of the buffer.
  • static TLS errors from jemalloc 5.0.0 built on CentOS 6

    static TLS errors from jemalloc 5.0.0 built on CentOS 6

    I help maintain packages on conda-forge which has become fairly popular in the Python community. We recently added jemalloc 5.0.0 to the package manager, built on CentOS 6 with devtoolset-2 from this base Docker image (glibc 2.12 I think)

    https://github.com/conda-forge/docker-images/blob/master/linux-anvil/Dockerfile

    On some platforms, like Ubuntu 14.04 (glibc 2.19), using dlopen on the produced shared library leads to errors like

    libjemalloc.so: cannot allocate memory in static TLS block
    

    What is the recommended workaround given that we need to compile on a glibc 2.12 system and deploy the binaries on systems with newer glibc?

    this may be related to https://sourceware.org/bugzilla/show_bug.cgi?id=14898

    cc @xhochy

  • Clean compilation with -Wextra

    Clean compilation with -Wextra

    Clean compilation -Wextra

    Before this pull-request jemalloc produced many warnings when compiled with -Wextra with both Clang and GCC. This pull-request fixes the issues raised by these warnings or suppresses them if they were spurious at least for the Clang and GCC versions covered by CI.

    This pull-request:

    • adds JEMALLOC_DIAGNOSTIC macros: JEMALLOC_DIAGNOSTIC_{PUSH,POP} are used to modify the stack of enabled diagnostics. The JEMALLOC_DIAGNOSTIC_IGNORE_... macros are used to ignore a concrete diagnostic.

    • adds JEMALLOC_FALLTHROUGH macro to explicitly state that falling through case labels in a switch statement is intended

    • locally supresses many unused argument warnings by adding missing UNUSED annotations

    • locally suppresses some -Wextra diagnostics:

      • -Wmissing-field-initializer is buggy in older Clang and GCC versions, not understanding that, in C, = {0} is a common idiom to initialize a struct to zero

      • -Wtype-bounds is suppressed in a particular situation where a generic macro, used in multiple different places, compares an unsigned integer for smaller than zero, which is always true.

      • -Walloc-larger-than-size= diagnostics warn when an allocation function is called with a size that is too large (out-of-range). These are suppressed in the parts of the tests where jemalloc explicitly does this to test that the allocation functions fail properly.

    • fixes a bug in the log.c tests where an array was being written out-of bounds, which was probably invoking undefined behavior.

    Closes #1196 .

  • physical memory goes high every several hours

    physical memory goes high every several hours

    Hi There,

    We are using jemalloc 5.0.1 on our project, and found that the physical memory usage goes high every several (>10) hours. Here is the log I captured, more than 10GB physical memory were used during this time:

    Allocated: 56763404160, active: 64958468096, metadata: 3267106432, resident: 70248562688, mapped: 70786420736, retained: 405324754944 Allocated: 56876350976, active: 65120444416, metadata: 3292205344, resident: 74587324416, mapped: 75117805568, retained: 405240102912 Allocated: 56737409856, active: 64979918848, metadata: 3293146528, resident: 75795795968, mapped: 76325535744, retained: 404032372736 Allocated: 56738962464, active: 64995016704, metadata: 3296629168, resident: 76685611008, mapped: 77218127872, retained: 403615834112 Allocated: 56968671360, active: 65284304896, metadata: 3296170416, resident: 78292492288, mapped: 78825009152, retained: 402008952832 Allocated: 56968786248, active: 65279537152, metadata: 3298034096, resident: 79658573824, mapped: 80191090688, retained: 400642871296 Allocated: 56941156840, active: 65251299328, metadata: 3297322160, resident: 80860139520, mapped: 81392623616, retained: 399441338368 Allocated: 56991072392, active: 65310920704, metadata: 3312494544, resident: 82332794880, mapped: 82864013312, retained: 399729459200 Allocated: 57126460528, active: 65457401856, metadata: 3318715504, resident: 83553558528, mapped: 84290650112, retained: 399185723392 Allocated: 56571929400, active: 64856027136, metadata: 3341452928, resident: 85106311168, mapped: 85832876032, retained: 400474652672 Allocated: 56948892104, active: 65236578304, metadata: 3443298560, resident: 84992585728, mapped: 85696909312, retained: 413038342144

    Except resident/mapped varies a lot, the others almost remains the same. What's the reason of this high physical memory usage? Does jemalloc reclaim the unused physical memory instantly or periodically? btw, Huge TLB is disabled on this machine.

  • 5.3 Release Candidate

    5.3 Release Candidate

    This issue is for tracking progress and communication related to the upcoming 5.3 release.

    Current release candidate: ed5fc14 which is going through our production testing pipeline.

    A previous commit was production deployed in November and no issues were observed.

    CC: @jasone

  • Build a general purpose thread event handler

    Build a general purpose thread event handler

    I haven't added unit tests. Will add later. I want to seek some early feedback on the design first.

    A few remarks:

    • I turned off the accumulation of allocation bytes on thread_allocated when reentrancy_level > 0. Otherwise the implementation would become very tricky, because when reentrancy_level > 0, profiling should be turned off by jemalloc design - see https://github.com/jemalloc/jemalloc/blob/785b84e60382515f1bf1a63457da7a7ab5d0a96b/include/jemalloc/internal/prof_inlines_b.h#L135 but if thread_allocated keeps increasing, the wait time till the next sampling event would be incorrect, unless we keep some internal state so that when reentrancy_level drops back to 0, we can adjust the wait time; this would require that the store and fetch for the internal state should have a reentrancy_level guard. I decided to rather have the guard on thread_allocated directly: after all, it's now not just a accumulator for allocations, but a counter for events, and in general it may not be a good idea to have allocations in jemalloc internal calls trigger events in the same way (which is probably why profiling was determined to be turned off for such internal allocations). And, just to be symmetric, I also turned off the incrementation of thread_deallocated when reentrancy_level > 0.

    • The lazy creation of tdata made the event counting tricky; it means that the event counters were fooled due to the wrong wait time, and that they need to be recovered to a state as if they had only reacted to the right wait time. See my comments in the code for how I am resolving this issue and why.

    • The event handler adopts a lazy approach for wait time reset, so that there's no longer the need to sometimes manually throw the wait time to a huge number (e.g. in the case where opt_prof is off), This ends up with cleaner code.

    • I also rearranged the tsd layout. prof_tdata has been no longer needed on the fast path ever since bytes_until_sample is extracted out of tdata, and now bytes_until_sample is also not needed on the fast path though we need an additional thread_allocated_threshold_fast on the fast path. We end up needing 8 bytes less on the fast path. So I shifted rtree_ctx ahead by 16 bytes (previously we put one extra slow path field ahead of it since it needs to be 16-byte aligned), and put all slow path fields after it. I also changed the layout diagram to reflect that, including adding the span for binshards (which takes quite an amount of space and I'm not sure if putting it completely before the tcache is optimal).

    • My comment in my earlier design #1616 are mostly still relevant -

      • The first two bullets on resolving double counting issue still apply.

      • The third bullet on overwriting thread_allocated is no longer relevant: the current implementation never overwrites it so as to be consistent with the promise to the application by jemalloc.

      • The fourth bullet on resolving overflow issue still applies. One slight but visible implementation detail: previously the test for event triggering is a strict less than comparison i.e. bytes_until_sample < 0, but now the equal case also counts - the test is now thread_allocated >= thread_allocated_threshold_fast (or the real threshold in the slow path). The reason is that when thread_allocated_threshold_fast is set to be 0, we want thread_allocated >= thread_allocated_threshold_fast to be always true, so that we can fall back to the slow path.

      • Regarding the fifth bullet on the need of increased delaying due to the delayed incrementation of thread_allocated: the issue is now resolved in a cleaner and more intuitive way, via the use of the thread_allocated_last_event counter (which is a nice side byproduct of it, since it's not designed for this purpose in the first place).

      • Regarding the sixth bullet on update flag being no longer necessary for prof_sample_check(): now the logic is even simpler - there's no need for the prof_sample_check() function completely, since the thread_event() call has adjusted the wait time if and only if an event was triggered, so we only need to check the remaining wait time rather than doing any further comparison.

  • Long max_wait_ns in jemalloc status.

    Long max_wait_ns in jemalloc status.

    Hi All,

    Described this issue on gitter and as directed raising this question here.

    The actual problem:

    We are using jemalloc version 5.0.1 in our multi-threaded file parsing application. In the process of improving our application throughput we noticed the below stack trace under gdb. #0 lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

    #1 0x00007f0afdb91dbd in GI___pthread_mutex_lock (mutex=0x7f0ae985b568) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f0afe481088 in je_malloc_mutex_lock (mutex=0x7f0ae985b568) at include/jemalloc/internal/mutex.h:77 #3 je_tcache_bin_flush_small (tbin=0x7f0ab3806088, binind=3, rem=4, tcache=0x7f0ab3806000) at src/tcache.c:105 #4 0x00007f0afe481bfd in je_tcache_event_hard (tcache=0x7f0ab3806000) at src/tcache.c:39 #5 0x00007f0afe457104 in je_tcache_event (tcache=0x7f0ab3806000) at include/jemalloc/internal/tcache.h:271 #6 je_tcache_alloc_large (size=, tcache=, zero=) at include/jemalloc/internal/tcache.h:384 #7 je_arena_malloc (zero=false, size=, arena=0x0, try_tcache=true) at include/jemalloc/internal/arena.h:969 #8 je_imalloct (arena=0x0, try_tcache=true, size=) at include/jemalloc/internal/jemalloc_internal.h:771 #9 je_imalloc (size=) at include/jemalloc/internal/jemalloc_internal.h:780 #10 malloc (size=) at src/jemalloc.c:929

    Later we noticed max_wait_ns value to be close to 0.3 seconds in jemalloc stats and this could be a potential problem for performance degradation.

    ___ Begin jemalloc statistics ___
    Version: 5.0.1-0-g896ed3a8b3f41998d4fb4d625d30ac63ef2d51fb
    Assertions disabled
    config.malloc_conf: ""
    Run-time option settings:
      opt.abort: false
      opt.abort_conf: false
      opt.retain: true
      opt.dss: "secondary"
      opt.narenas: 16
      opt.percpu_arena: "disabled"
      opt.background_thread: false (background_thread: false)
      opt.dirty_decay_ms: 10000 (arenas.dirty_decay_ms: 10000)
      opt.muzzy_decay_ms: 10000 (arenas.muzzy_decay_ms: 10000)
      opt.junk: "false"
      opt.zero: false
      opt.tcache: true
      opt.lg_tcache_max: 15
      opt.stats_print: true
      opt.stats_print_opts: ""
    Arenas: 16
    Quantum size: 16
    Page size: 4096
    Maximum thread-cached size class: 32768
    Allocated: 9712028528, active: 9955016704, metadata: 66047272, resident: 10031435776, mapped: 10083471360, retained: 1786408960
                               n_lock_ops       n_waiting      n_spin_acq  n_owner_switch   total_wait_ns     max_wait_ns  max_n_thds
    background_thread:               2810               0               0            2123               0               0           0
    ctl:                            76341             315               2            2108     12339917843       215998561           6
    prof:                               0               0               0               0               0               0           0
    Merged arenas stats:
    assigned threads: 24
    uptime: 389717405659
    dss allocation precedence: N/A
    decaying:  time       npages       sweeps     madvises       purged
       dirty:   N/A         2689         4411        18713       591223
       muzzy:   N/A            0            0            0            0
                                allocated      nmalloc      ndalloc    nrequests
    small:                     2713926512    102075050     89041810   8310924619
    large:                     6998102016       222174       166637       222174
    total:                     9712028528    102297224     89208447   8311146793
    active:                    9955016704
    mapped:                   10083471360
    retained:                  1786408960
    base:                        65356320
    internal:                      690952
    tcache:                       3343072
    resident:                 10031435776
                               n_lock_ops       n_waiting      n_spin_acq  n_owner_switch   total_wait_ns     max_wait_ns  max_n_thds
    large:                         161184             417               0           44538     33959773925       271998189           1
    extent_avail:                  407258               7               4           44551       127999148        67999548           1
    extents_dirty:                2683031            2012              61           54134     94579370305       255998296           1
    extents_muzzy:                 252944              19               1           40281               0               0           1
    extents_retained:              480595              33               3           42893      1291991401       139999069           1
    decay_dirty:                   170619               4               6           48972       335997762       119999201           1
    decay_muzzy:                   156830               0               3           48940               0               0           0
    base:                          319374               0               0           40081               0               0           0
    tcache_list:                    44446               0               0           33755               0               0           0
    bins:           size ind    allocated      nmalloc      ndalloc    nrequests      curregs     curslabs regs pgs  util       nfills     nflushes     newslabs      reslabs   n_lock_ops    n_waiting   n_spin_acq  total_wait_ns  max_wait_ns
                       8   0      9174768      4906777      3759931     36601392      1146846         2306  512   1 0.971      1004693       406832         4085       234510      1461810         2239           15   180946795463    467996886
                      16   1      7205120      4817615      4367295     32557685       450320         1822  256   1 0.965      2105395       415422         3099       549408      2569615         1948           24   154846969211    443997045
                      32   2    116546176     14673352     11031284    157879482      3642068        28549  128   1 0.996      3830591       663205        61628      1105934      4633249         3803           19   324037842967    367997551
                      48   3    119091024     21117092     18636029    146136133      2481063         9757  256   3 0.993      4776481       675422        31438       818636      5550373         2310           22   175370832583    387997418
                      64   4    173088448      8406224      5701717   3656045793      2704507        42320   64   1 0.998      2346330       478115        84551      1093944      2996245         1159           20    86831421953    351997657
                      80   5    128189680      6971937      5369566    106557284      1602371         6326  256   5 0.989      2639375       597101        11751       663173      3300550         1751           19   141603057405    379997470
                      96   6      3019200      3390002      3358552     23937634        31450          262  128   3 0.937      2327388       511514          346       687947      2884782          978           25    77851481768    283998109
                     112   7       269696       182457       180049      1175585         2408           15  256   7 0.627       176066        84788         1301         4039       307860          257            2    16923887325    199998668
                     128   8      2589824       261022       240789       447452        20233          663   32   1 0.953       200371        93323         2055        74910       342346          361           15    20267865081    255998296
                     160   9      1150880       102178        94985      3362154         7193           67  128   5 0.838        91210        45200          204        29311       181170          148            1     9171938926    127999148
                     192  10       229632        32430        31234      4559539         1196           35   64   3 0.533        11140         6582           52         2082        66033           96            3     3019979894    327997817
                     224  11     20437536       225910       134671      1162590        91239          729  128   7 0.977        52299        22166         1147        16553       120472           79            2     3583976143    111999255
                     256  12      3653888       130131       115858     13279718        14273          936   16   1 0.953        90379        33400         2894        65328       173060           93            4     4943967090    123999175
                     320  13      3213440       114813       104771     26059425        10042          182   64   5 0.862        95457        30952          240        36418       171252           40            2     2047986367    235998430
                     384  14       905856        22560        20201      1410778         2359           92   32   3 0.801        16780         6798          145         6921        68222           31            1     1295991369    127999148
                     448  15      1025024      1142867      1140579     50540401         2288           48   64   7 0.744       928728       455633           95       319196      1428922         1530            6   126807155844    375997497
                     512  16      1353728     34435026     34432382   3395765781         2644          377    8   1 0.876      3476467      3446590       770446     18050069      8507991         1339           37    81695456124    199998668
                     640  17      1716480         7842         5160      2634044         2682          104   32   5 0.805         2258         1301          156          544        48186           12            2      499996672    131999121
                     768  18      1994496         7740         5143      6693346         2597          183   16   3 0.886         2878         1693          323          629        49453           20            0      575996164    115999227
                     896  19      1291136         7314         5873    270189198         1441           65   32   7 0.692         3130         1769           97          991        49449           20            1      611995925    111999254
                    1024  20     14321664       190116       176130     21647683        13986         3560    4   1 0.982       118507        21516        23534        69129       228861          476           17    13303911434    375997498
                    1280  21     25966080        50427        30141     44317280        20286         1291   16   5 0.982         8026         2866         2609         2081        59238           20            0      447997016     55999627
                    1536  22      7971840        17197        12007        44602         5190          689    8   3 0.941         5572         2605         1404         3909        55584          160            6     3507976648    127999148
                    1792  23      6228992        37802        34326       191055         3476          247   16   7 0.879        32107        30131          385         1154       107241          165            0    12707915409    251998323
                    2048  24     18716672        17880         8741       204504         9139         4590    2   1 0.995         6315         3215         7715         3101        64821           20            0      655995630     59999601
                    2560  25   1888980480       744621         6738      5536396       737883        92250    8   5 0.999       279834         1608        92822          991       419264          398            5    15515896716    131999121
                    3072  26     14505984         9670         4948     20025486         4722         1220    4   3 0.967         2709         1411         2005         1061        51334           22            2      535996433    111999254
                    3584  27      7741440         5297         3137      2398773         2160          306    8   7 0.882         1847          841          487          676        47777           11            0      355997631     91999387
                    4096  28     12161024         8915         5946    270301101         2969         2969    1   1 1             3838         2321         8915            0        65444           26            0      651995660     63999574
                    5120  29     14402560         7441         4628        21996         2813          754    4   5 0.932         3289         1946         1189         1185        51282           65            5     1255991640    127999148
                    6144  30     13910016         4551         2287        13466         2264         1164    2   3 0.972         1351          563         2031          506        49239            9            0      443997042     91999388
                    7168  31     12350464         4218         2495        10929         1723          471    4   7 0.914         1303          545          841          694        47484            8            0      347997685    111999255
                    8192  32     17301504         9633         7521      9182132         2112         2112    1   2 1             6583         5561         9633            0        73718           60            1     2635982456    119999202
                   10240  33     23541760         4836         2537        13072         2299         1184    2   5 0.970         1797          798         2000          699        49865           21            2     1027993156    119999201
                   12288  34     19795968         3673         2062        10838         1611         1611    1   3 1             1412          505         3673            0        52073            6            0       91999387     59999600
                   14336  35     19884032         3484         2097         9892         1387          722    2   7 0.960         1439          608         1484          593        48712           17            0      159998934     35999761
    large:          size ind    allocated      nmalloc      ndalloc    nrequests  curlextents
                   16384  36     30670848       118412       116540       129671         1872
                   20480  37     32071680         3309         1743        10069         1566
                   24576  38     15704064         1845         1206        19239          639
                   28672  39     24457216         1965         1112         7376          853
                   32768...
    
  • Provide a way to decommit instead of purge

    Provide a way to decommit instead of purge

    There are reasons one may want to decommit (which currently does nothing). You can implement commit/decommit with chunk_hooks currently, but the thing is that that has a non negligible performance impact. I found a way to achieve a decent equivalent with less performance impact by making purge not purge but actually decommit (and keeping chunk_hooks_t.decommit doing nothing). In fact, it looks like it even improves(!) performance on some of our Firefox benchmars (which would mean that in some cases we're better off with MEM_DECOMMIT than MEM_RESET).

    But that fails because chunks marked as purged are not committed before they are used. (then, there's also huge alloc shrinkage that will try to memset without committing, too)

    I kind of worked around the issue by changing the flags set in arena_purge_stashed to use CHUNK_MAP_DECOMMITTED instead of CHUNK_MAP_UNZEROED. That's kind of gross and obviously doesn't solve the problem with memset for huge alloc shrinkage.

    Now the question is how can we properly hook this?

    Kind of relatedly, I'd like for arena_purge_all to be able to go through all previously purged chunks again, and it seems this could be achieved by setting the flags to something other than CHUNK_MAP_UNZEROED too. Seems there's an opportunity to kill two birds with one stone.

  • Mismatched calls to VirtualAlloc/VirtualFree

    Mismatched calls to VirtualAlloc/VirtualFree

    I think at two different moments somewhere between 3.6 and current tip things have changed in how chunks are handled that make the use of VirtualAlloc and VirtualFree on Windows problematic.

    The way they work, you need to match a addr = VirtualAlloc(addr, size, MEM_RESERVE) with a VirtualFree(addr, 0, MEM_RELEASE) (and it has to be 0, not size).

    The problem is that while before we wouldn't end up calling pages_unmap with values of addr and size that don't match previous pages_map, we now do. Essentially, we end up allocating multiple chunks in one go, and deallocating parts of them (leading or trailing) independently.

    So in practice, we're doing things like this, for example:

    addr = pages_map(NULL, 6 * chunksize);
    pages_unmap(addr + 4 * chunksize, 2 * chunksize);
    

    which actually does:

    addr = VirtualAlloc(NULL, 6 * chunksize, MEM_RESERVE);
    VirtualFree(addr + 4 * chunksize, 0, MEM_RELEASE);
    

    and that does nothing, which is wasteful, but not entirely problematic (or aborts with --enable-debug, which is)

    But worse, we're also doing:

    addr = pages_map(NULL, 6 * chunksize);
    pages_unmap(addr, 2 * chunksize);
    

    which actually does

    addr = VirtualAlloc(NULL, 6 * chunksize, MEM_RESERVE);
    VirtualFree(addr, 0, MEM_RELEASE);
    

    and that blows things away since that releases the 6 chunks when we expect the remaining 4 to still be around. This was definitely made worse by the decrease in chunk size, which made it happen more often (it seems it was not happening before, but it might actually be the cause the random crashes we're seeing).

    I've attempted to work around this by making chunksize the canonical size we always VirtualAlloc in the end, but that likely adds a lot of overhead. At least it removes the crashes I'm seeing with current tip, but it feels we need something better than this.

    I was thinking maybe having bigger meta-chunks and making pages_map MEM_COMMIT and pages_unmap MEM_DECOMMIT ranges in there, and have the meta-chunks released when they are entirely decommitted (which, OTOH, requires some extra metadata, which adds a chicken-and-egg problem, AIUI, even the base allocator is using chunk allocation code and ends up in pages_map).

    Thoughts?

  • ubsan - left shift of 4095 by 20 places cannot be represented in type 'int'

    ubsan - left shift of 4095 by 20 places cannot be represented in type 'int'

    As part of debugging #2356 I have built jemalloc with GCC ubsan and it then reports

    src/jemalloc.c:3201:6: runtime error: left shift of 4095 by 20 places cannot be represented in type 'int'
    

    for quite a number of tests.

    The build was configured with

    EXTRA_CFLAGS=-fsanitize=undefined EXTRA_CXXFLAGS=-fsanitize=undefined LDFLAGS=-fsanitize=undefined ./autogen.sh
    

    gcc is gcc-12.2.1-4.fc36.x86_64

  • test/stress/cpp/microbench failing with Floating point exception after recent commit

    test/stress/cpp/microbench failing with Floating point exception after recent commit

    We can see a test failure on x86_64 and s390x platforms in our CI after recent commits (481bbfc9906e7744716677edd49d0d6c22556a1a was OK). Both looks the same

    ...
    === test/stress/cpp/microbench ===
    100000000 iterations, malloc_free=1044987us (10.449 ns/iter), new_delete=1059986us (10.599 ns/iter), time consumption ratio=0.986:1
    test_free_vs_delete (non-reentrant): pass
    test/test.sh: line 34: 895548 Floating point exception(core dumped) $JEMALLOC_TEST_PREFIX ${t} /var/lib/jenkins/workspace/jemalloc/label/x86_64/ /var/lib/jenkins/workspace/jemalloc/label/x86_64/
    Test harness error: test/stress/cpp/microbench w/ MALLOC_CONF=""
    Use prefix to debug, e.g. JEMALLOC_TEST_PREFIX="gdb --args" sh test/test.sh test/stress/cpp/microbench
    make: *** [Makefile:707: stress] Error 1
    Build step 'Execute shell' marked build as failure
    

    The environment is Fedora 36 on x86_64 and s390x, there is no such problem other platforms (aarch64, ppc64le) it seems.

  • Add loongarch64 LG_QUANTUM size definition.

    Add loongarch64 LG_QUANTUM size definition.

    1. Jemalloc compiles failed on loongarch architecture. Error message of building jemalloc,such as: gcc -std=gnu11 -Wall -Wextra -Wsign-compare -Wundef -Wno-format-zero-length -pipe -g3 -fvisibility=hidden -O3 -funroll-loops -g -O2 -fdebug-prefix-map=/home/loongson/debian-community/jemalloc/sys-jemalloc/debian-pa/jemalloc-5.2.1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -DPIC -c -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -DJEMALLOC_NO_PRIVATE_NAMESPACE -o src/jemalloc.sym.o src/jemalloc.c In file included from include/jemalloc/internal/jemalloc_internal_types.h:4, from include/jemalloc/internal/sc.h:4, from include/jemalloc/internal/arena_types.h:4, from include/jemalloc/internal/jemalloc_internal_includes.h:45, from src/jemalloc.c:3: include/jemalloc/internal/quantum.h:68:6: error: #error "Unknown minimum alignment for architecture; specify via " error "Unknown minimum alignment for architecture; specify via " include/jemalloc/internal/quantum.h:69:3: error: expected identifier or ‘(’ before string constant "--with-lg-quantum"

    2. Build env $ arch loongarch64 $ gcc --version gcc (Loongnix 8.3.0-6.lnd.vec.33) 8.3.0 Copyright (C) 2018 Free Software Foundation, Inc.

    Please consider this pull request, thanks.

  • Crash only with Jemalloc, ASan is clean

    Crash only with Jemalloc, ASan is clean

    Hi, I'm facing yet another case of application that works fine, but crashes with Jemalloc. I know there have been several similar issues already, but none helped me with fixing it. Usually the problem is elsewhere and not in Jemalloc, but in this case I'm not sure where to look at.

    The problem is that Kurento Media Server seems to run fine with default system malloc, and has no memory issue detected by Address Sanitizer, but it crashes if Jemalloc is preloaded (with LD_PRELOAD). Not at first, but eventually, with enough repeated tests, it ends up crashing where no crash would happen otherwise without Jemalloc. However, to be even weirder, this only happens when running within a Docker container, and never if I run it in my host machine.

    I've tried with my system's provided Jemalloc (v5.2.1) and with current git dev branch. When running under GDB, it seems that every time the backtrace is different, which leads me to believe it is indeed a memory issue. But one that only manifests with Jemalloc.

    Kurento uses GLib, which has its own slice allocators, so this environment variable is set: G_SLICE=always-malloc.

    An extra complication happens: Jemalloc's --enable-debug cannot be used. The libsigc++ library contains a bit of memory trickery that is confirmed to not contain leaks, but it's a well known issue that its slot_base class destructor does confuse memory analyzers (see sigc::mem_fun alloc/dealloc issue):

    I can understand that the error messages from -fsanitize=address cause concern, but I believe that they can be ignored in this case. It's not nice, but we will probably have to accept it in sigc++-2.0.

    So, if I build Jemalloc with --enable-debug (which itself activates --enable-opt-size-checks), then the issue becomes hidden because Jemalloc aborts on the false positive from sigc++.

    Address Sanitizer allows running with new_delete_type_mismatch=0, to disable reports about this particular false positive, but I cannot see anything similar for Jemalloc. Thus, I cannot run with either of --enable-debug or --enable-opt-size-checks. (I guess this could become a feature request in itself?)

    Including here a couple stack traces that happens when --enable-debug is used without and with GDB debugger:

    Stack trace 1:

    • --enable-debug: YES
    • opt-size-checks: YES
    • GDB: NO
    <jemalloc>: size mismatch detected (true size 0 vs input size 0), likely caused by application sized deallocation bugs (source address: 0x7f3f40239530, the current pointer being freed). Suggest building with --enable-debug or address sanitizer for debugging. Abort.
    Aborted (thread 139909551523392, pid 124366)
    Stack trace:
    [__pthread_kill_implementation]
    ./nptl/pthread_kill.c:44
    [__GI_raise]
    sysdeps/posix/raise.c:27
    [__GI_abort]
    ./stdlib/abort.c:81 (discriminator 21)
    [je_safety_check_fail]
    /jemalloc/src/safety_check.c:34
    [je_safety_check_fail_sized_dealloc]
    /jemalloc/src/safety_check.c:16 (discriminator 4)
    [maybe_check_alloc_ctx]
    /jemalloc/src/jemalloc.c:2929
    [isfree]
    /jemalloc/src/jemalloc.c:2986
    [je_sdallocx_default]
    /jemalloc/src/jemalloc.c:3995
    [je_je_sdallocx_noflags]
    /jemalloc/src/jemalloc.c:4019
    [sizedDeleteImpl(void*, unsigned long)]
    /jemalloc/src/jemalloc_cpp.cpp:201
    [operator delete(void*, unsigned long)]
    /jemalloc/src/jemalloc_cpp.cpp:207
    [sigc::internal::signal_impl::notify(void*)]
    /usr/include/c++/9/ext/new_allocator.h:128
    [std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()]
    /usr/include/c++/9/bits/shared_ptr_base.h:729
    0x55964f6c224c at /kurento/build-Debug/kurento-media-server/server/kurento-media-server
    0x55964f62f909 at /kurento/build-Debug/kurento-media-server/server/kurento-media-server
    [std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()]
    /usr/include/c++/9/bits/shared_ptr_base.h:167
    Aborted (core dumped)
    

    Stack trace 2:

    • --enable-debug: YES
    • opt-size-checks: YES
    • GDB: YES
    <jemalloc>: size mismatch detected (true size 0 vs input size 0), likely caused by application sized deallocation bugs (source address: 0x7fffe6413810, the current pointer being freed). Suggest building with --enable-debug or address sanitizer for debugging. Abort.
    --Type <RET> for more, q to quit, c to continue without paging--c
    
    Thread 15 "kurento-media-s" received signal SIGABRT, Aborted.
    [Switching to Thread 0x7fffebcd3640 (LWP 126681)]
    __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737149482560) at ./nptl/pthread_kill.c:44
    44      ./nptl/pthread_kill.c: No such file or directory.
    (gdb) bt
    #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737149482560) at ./nptl/pthread_kill.c:44
    #1  __pthread_kill_internal (signo=6, threadid=140737149482560) at ./nptl/pthread_kill.c:78
    #2  __GI___pthread_kill (threadid=140737149482560, [email protected]=6) at ./nptl/pthread_kill.c:89
    #3  0x00007ffff6cfc476 in __GI_raise ([email protected]=6) at ../sysdeps/posix/raise.c:26
    #4  0x00007ffff6ce27f3 in __GI_abort () at ./stdlib/abort.c:79
    #5  0x00007ffff7d2aec2 in je_safety_check_fail
        (format=0x7ffff7d6b750 "<jemalloc>: size mismatch detected (true size %zu vs input size %zu), likely caused by application sized deallocation bugs (source address: %p, %s). Suggest building with --enable-debug or address san"...) at src/safety_check.c:32
    #6  0x00007ffff7d2adbb in je_safety_check_fail_sized_dealloc (current_dealloc=true, ptr=0x7fffe6413810, true_size=0, input_size=0)
        at src/safety_check.c:11
    #7  0x00007ffff7c72d23 in maybe_check_alloc_ctx (tsd=0x7fffebcd2be8, ptr=0x7fffe6413810, alloc_ctx=0x7fffebcd0090) at src/jemalloc.c:2925
    #8  0x00007ffff7c72ff4 in isfree (tsd=0x7fffebcd2be8, ptr=0x7fffe6413810, usize=48, tcache=0x7fffebcd2f40, slow_path=true) at src/jemalloc.c:2986
    #9  0x00007ffff7c7695a in je_sdallocx_default (ptr=0x7fffe6413810, size=48, flags=0) at src/jemalloc.c:3993
    #10 0x00007ffff7c76c2a in je_je_sdallocx_noflags (ptr=0x7fffe6413810, size=48) at src/jemalloc.c:4016
    #11 0x00007ffff7d522b4 in sizedDeleteImpl(void*, std::size_t) (ptr=0x7fffe6413810, size=48) at src/jemalloc_cpp.cpp:201
    #12 0x00007ffff7d522e0 in operator delete(void*, unsigned long) (ptr=0x7fffe6413810, size=48) at src/jemalloc_cpp.cpp:206
    #13 0x00007ffff711cb18 in __gnu_cxx::new_allocator<std::_List_node<sigc::slot_base> >::destroy<sigc::slot_base>(sigc::slot_base*)
        (this=0x7fffe64451c8, __p=0x7fffe640bcb0) at /usr/include/c++/9/ext/new_allocator.h:153
    #14 std::allocator_traits<std::allocator<std::_List_node<sigc::slot_base> > >::destroy<sigc::slot_base>(std::allocator<std::_List_node<sigc::slot_base> >&, sigc::slot_base*) (__a=..., __p=0x7fffe640bcb0) at /usr/include/c++/9/bits/alloc_traits.h:497
    #15 std::__cxx11::list<sigc::slot_base, std::allocator<sigc::slot_base> >::_M_erase(std::_List_iterator<sigc::slot_base>)Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
    
        (__position=, this=0x7fffe64451c8) at /usr/include/c++/9/bits/stl_list.h:1921
    #16 std::__cxx11::list<sigc::slot_base, std::allocator<sigc::slot_base> >::erase(std::_List_const_iterator<sigc::slot_base>)Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
    
        (__position=, this=0x7fffe64451c8) at /usr/include/c++/9/bits/list.tcc:158
    #17 sigc::internal::signal_impl::notify(void*) (d=0x7fffe981d0d0) at signal_base.cc:169
    #18 0x00007ffff78ea546 in kurento::EventHandler::~EventHandler() (this=0x7fffe9806c60, __in_chrg=<optimized out>)
        at /kurento/kms-core/src/server/implementation/EventHandler.cpp:44
    

    Notice how the issue happens in the context of sigc::slot_base, which is the false positive I mentioned above.

    Stack trace 3:

    • --enable-debug: YES
    • opt-size-checks: NO (had to edit file jemalloc_preamble.h.in to force set it to false so it does not get enabled)
    • GDB: NO
    <jemalloc>: include/jemalloc/internal/arena_inlines_b.h:462: Failed assertion: "alloc_ctx.szind == edata_szind_get(edata)"
    Aborted (thread 140330790446656, pid 143096)
    Stack trace:
    [__pthread_kill_implementation]
    ./nptl/pthread_kill.c:44
    [__GI_raise]
    sysdeps/posix/raise.c:27
    [__GI_abort]
    ./stdlib/abort.c:81 (discriminator 21)
    [arena_sdalloc]
    /jemalloc/include/jemalloc/internal/arena_inlines_b.h:463
    [isdalloct]
    /jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h:134
    [isfree]
    /jemalloc/src/jemalloc.c:3010
    [je_sdallocx_default]
    /jemalloc/src/jemalloc.c:3995
    [je_je_sdallocx_noflags]
    /jemalloc/src/jemalloc.c:4019
    [sizedDeleteImpl(void*, unsigned long)]
    /jemalloc/src/jemalloc_cpp.cpp:201
    [operator delete(void*, unsigned long)]
    /jemalloc/src/jemalloc_cpp.cpp:207
    [sigc::internal::signal_impl::notify(void*)]
    /usr/include/c++/9/ext/new_allocator.h:128
    [std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()]
    /usr/include/c++/9/bits/shared_ptr_base.h:729
    0x560978fa324c at /kurento/build-Debug/kurento-media-server/server/kurento-media-server
    0x560978f10909 at /kurento/build-Debug/kurento-media-server/server/kurento-media-server
    [std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()]
    /usr/include/c++/9/bits/shared_ptr_base.h:167
    [std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data() const]
    /usr/include/c++/9/bits/basic_string.h:191
    Aborted (core dumped)
    

    All these are to show how the false positive hides the actual issue I wanted to investigate... so maybe a mechanism to disable these checks would be helpful in Jemalloc.

    Meanwhile, I just wanted to report this with a detailed explanation, and ask if any idea comes up, as I haven't been able to find the issue that makes it crash when Jemalloc is in use.

    In any case, thanks for the effort of this project!

  • Potential issue with threads that only deallocate

    Potential issue with threads that only deallocate

    With the modifications in the latest PR #2349, a thread that only deallocates memory may cause a memory leak at a VERY SMALL POSSIBILITY, only when satisfying BOTH CONDITIONS below.

    • The thread only deallocates memory during its life time and has NEVER allocated any memory.
    • The total number of deallocations called by this thread reaches the threshold, i.e., 128 times, while the thread is being killed, i.e., pthread_exit has been reached but libc is still deallocating memory.
  • Stat idea: peak tracking for resident/dirty/etc.

    Stat idea: peak tracking for resident/dirty/etc.

    In cases where stats were grabbed during a quiescent period, it'd sometimes be handy to know what the world looked like when the heap was at its largest (not exactly captured just by the peaks, but at least it's something).

mimalloc is a compact general purpose allocator with excellent performance.
mimalloc is a compact general purpose allocator with excellent performance.

mimalloc mimalloc (pronounced "me-malloc") is a general purpose allocator with excellent performance characteristics. Initially developed by Daan Leij

Nov 29, 2022
Custom implementation of C stdlib malloc(), realloc(), and free() functions.

C-Stdlib-Malloc-Implementation NOT INTENDED TO BE COMPILED AND RAN, DRIVER CODE NOT OWNED BY I, ARCINI This is a custom implmentation of the standard

Dec 27, 2021
A poggers malloc implementation

pogmalloc(3) A poggers malloc implementation Features Static allocator Real heap allocator (via sbrk(2)) Builtin GC Can also mark static memory Can be

Jun 12, 2022
Test your malloc protection
Test your malloc protection

Test your allocs protections and leaks ! Report Bug · Request Feature Table of Contents About The Tool Getting Started Prerequisites Quickstart Usage

Nov 29, 2022
Malloc Lab: simple memory allocator using sorted segregated free list

LAB 6: Malloc Lab Main Files mm.{c,h} - Your solution malloc package. mdriver.c - The malloc driver that tests your mm.c file short{1,2}-bal.rep - T

Feb 28, 2022
Hardened malloc - Hardened allocator designed for modern systems

Hardened malloc - Hardened allocator designed for modern systems. It has integration into Android's Bionic libc and can be used externally with musl and glibc as a dynamic library for use on other Linux-based platforms. It will gain more portability / integration over time.

Dec 3, 2022
Mimalloc-bench - Suite for benchmarking malloc implementations.
Mimalloc-bench - Suite for benchmarking malloc implementations.

Mimalloc-bench Suite for benchmarking malloc implementations, originally developed for benchmarking mimalloc. Collection of various benchmarks from th

Dec 1, 2022
Implementation of System V shared memory (a type of inter process communication) in xv6 operating system.

NOTE: we have stopped maintaining the x86 version of xv6, and switched our efforts to the RISC-V version (https://github.com/mit-pdos/xv6-riscv.git)

Feb 21, 2022
The Best and Highest-Leveled and Newest bingding for MiMalloc Ever Existed in Rust

The Best and Highest-Leveled and Newest bingding for MiMalloc Ever Existed in Rust mimalloc 1.7.2 stable Why create this in repo https://github.com/pu

Nov 26, 2022
STL compatible C++ memory allocator library using a new RawAllocator concept that is similar to an Allocator but easier to use and write.

memory The C++ STL allocator model has various flaws. For example, they are fixed to a certain type, because they are almost necessarily required to b

Dec 2, 2022
OpenXenium JTAG and Flash Memory programmer
OpenXenium JTAG and Flash Memory programmer

OpenXenium JTAG and Flash Memory programmer * Read: "Home Brew" on ORIGINAL XBOX - a detailed article on why and how * The tools in this repo will all

Oct 23, 2022
A simple windows driver that can read and write to process memory from kernel mode

ReadWriteProcessMemoryDriver A simple windows driver that can read and write to process memory from kernel mode This was just a small project for me t

Nov 9, 2022
MMCTX (Memory Management ConTeXualizer), is a tiny (< 300 lines), single header C99 library that allows for easier memory management by implementing contexts that remember allocations for you and provide freeall()-like functionality.

MMCTX (Memory Management ConTeXualizer), is a tiny (< 300 lines), single header C99 library that allows for easier memory management by implementing contexts that remember allocations for you and provide freeall()-like functionality.

Oct 2, 2021
Tool for profiling heap usage and memory management
Tool for profiling heap usage and memory management

vizzy > ./build/vizzytrace /tmp/heapinfo.trace /bin/find /home/zznop -name vizzy _ _ ____ ____ ____ _ _ ( \/ )(_ _)(_ )(_ )( \/ ) \ /

Jul 22, 2022
STL compatible C++ memory allocator library using a new RawAllocator concept that is similar to an Allocator but easier to use and write.
STL compatible C++ memory allocator library using a new RawAllocator concept that is similar to an Allocator but easier to use and write.

STL compatible C++ memory allocator library using a new RawAllocator concept that is similar to an Allocator but easier to use and write.

Dec 2, 2021
A C++ Class and Template Library for Performance Critical Applications

Spirick Tuning A C++ Class and Template Library for Performance Critical Applications Optimized for Performance The Spirick Tuning library provides a

Dec 6, 2021
Test cpu and memory speed at linux-vps

Тест скорости процессора и памяти на linux-vps. Занимается бессмысленным перемножением массивов случайных чисел, для определения скорости процессора и

Nov 30, 2021
Using shared memory to communicate between two executables or processes, for Windows, Linux and MacOS (posix). Can also be useful for remote visualization/debugging.

shared-memory-example Using shared memory to communicate between two executables or processes, for Windows, Linux and MacOS (posix). Can also be usefu

Aug 17, 2022
A simple C++ library for creating and managing bitstreams in memory.

ezbitstream (v0.001) A simple C++ library for creating and managing bitstreams in memory. API & Implementation ezbitstream implements bitstreams with

Feb 4, 2022