Exploring snmalloc internals

Note: this is an updating post.

Introduction⌗

snmalloc is a memory allocator by Microsoft Research that uses a “message passing” scheme. You can find its source code here. It is designed to be performant in highly parallel workloads where memory allocated in one thread is typically deallocated in another thread. A nice catch is that snmalloc is highly customizable, and more to my interest, security mitigations can be customized. It also provides abstraction layers for different architectures (AAL) and platforms (PAL).

snmalloc is available as an allocator in CheriBSD, which is a fork of FreeBSD with CHERI support. Currently, jemalloc is still the default allocator in FreeBSD but we can build with LIBC_MALLOC=snmalloc make options to try snmalloc out.

This article documents my exploration of the internals of snmalloc as available in its official repository.

Custom configuration⌗

The default behaviour of the snmalloc global memory allocator is defined by the StandardConfig class in backend/globalconfig.h but it can be customized by defining the SNMALLOC_PROVIDE_OWN_CONFIG macro and exporting a customized allocator type as snmalloc::Alloc, as in:

// backend/globalconfig.h
  /**
   * Create allocator type for this configuration.
   */
  using Alloc = snmalloc::LocalAllocator<snmalloc::StandardConfig>;

The configuration class defines the implementation of Pagemap and Authmap (only relevant to architectures that support strict provenance, for now just CHERI), the Backend and the LocalState.

We will assume the given standard config in this article.

Frontend vs backend⌗

There are many mentions of the concepts of frontend and backend in the (quite detailed) documentation in snmalloc’s repository. In simple terms, a frontend allocator gets raw memory chunks from the backend and assigns allocator-specific metadata to them, like sizeclass and slab information. The backend interfaces with the kernel with memory-mapping operations and keeps track of memory chunks to service the frontend.

The backend behavior is defined by the Backend class in the configuration. The frontend behavior is defined by the Alloc class exported by the configuration, which is a LocalAllocator. Each local allocator has an associated core allocator core_alloc of type CoreAlloc and a cache local_cache to hold freelists of small sizeclass.

// mem/localalloc.h
  template<SNMALLOC_CONCEPT(IsConfig) Config_>
  class LocalAllocator
  {
    // [...]
  private:
    // [...]

    // Free list per small size class.  These are used for
    // allocation on the fast path. This part of the code is inspired by
    // mimalloc.
    // Also contains remote deallocation cache.
    LocalCache local_cache{&Config::unused_remote};

    // Underlying allocator for most non-fast path operations.
    CoreAlloc* core_alloc{nullptr};
    
    // [...]
  }

The exposed API to the user, eg. malloc, free, realloc, are wrappers around the ThreadAlloc class, which is in turn a wrapper around Alloc (see global/global.h and override/override.h). ThreadAlloc is used to hold a thread-local allocator of configurable type Alloc and the thread-local state. This is similar to tcaches in Linux glibc malloc: each thread has their own local allocator object with their own small_fast_free_lists in local_cache. While local allocators are associated with one core allocator, a core allocator can be associated with many local allocators of different threads, in which case they are sharing the same global state (similar to how different threads share the same fastbins, small_bins, unsorted_bins and large_bins in glibc).

Apart from ThreadAlloc, ScopedAllocator is another wrapper around Alloc that doesn’t depend on thread-local storage, so it can be used as a bootstrapping (slow) allocator, as in test/func/thread_alloc_external/thread_alloc_external.cc.

Frontend allocation⌗

Because the snmalloc code base uses CapPtr<T, B> heavily to annotate pointers to heap memory with bounding, we will use the same terminology to refer to the memory allocations. That is, a certain allocation can be a:

Alloc: this is owned by the frontend
Chunk: the backend manages chunk allocations and returns them to the frontend. This is what I called raw chunks. The frontend assigns metadata to chunks and converts them into allocs
Arena: only owned by the backend, which are then further refined into chunks

// ds_core/ptrwrap.h
    namespace bounds
    {
      /**
       * Internal access to an entire Arena.  These exist only in the backend.
       */
      using Arena = bound<
        dimension::Spatial::Arena,
        dimension::AddressSpaceControl::Full,
        dimension::Wildness::Tame>;

      /**
       * Internal access to a Chunk of memory.  These flow across the boundary
       * between back- and front-ends, for example.
       */
      using Chunk = bound<
        dimension::Spatial::Chunk,
        dimension::AddressSpaceControl::Full,
        dimension::Wildness::Tame>;

      /**
       * User access to an entire Chunk.  Used as an ephemeral state when
       * returning a large allocation.  See capptr_chunk_is_alloc.
       */
      using ChunkUser =
        Chunk::with_address_space_control<dimension::AddressSpaceControl::User>;

      /**
       * Internal access to just one allocation (usually, within a slab).
       */
      using AllocFull = Chunk::with_spatial<dimension::Spatial::Alloc>;

      /**
       * User access to just one allocation (usually, within a slab).
       */
      using Alloc = AllocFull::with_address_space_control<
        dimension::AddressSpaceControl::User>;

      /**
       * A wild (i.e., putative) CBAllocExport pointer handed back by the
       * client. See capptr_from_client() and capptr_domesticate().
       */
      using AllocWild = Alloc::with_wildness<dimension::Wildness::Wild>;
    } // namespace bounds

I know it gets confusing, especially if you have familiarity with other memory allocators that use the same name to refer to different concepts. Shikata ga nai, hopefully I’m clear enough and not confused myself.

Allocators⌗

As I said, a malloc request first reaches the local allocator. The local allocator is said to handle fast path allocations as it contains freelists in its local cache that store previously freed allocs (think glibc tcache). If there are no allocs of the requested size, it delegates the allocation request to its core allocator and that is slow path allocation. Some key functions exposed by LocalAllocator are:

alloc
- alloc_not_small: for large requests, request a chunk from the backend and insert into the core allocator laden list of inactive slabs.
- small_alloc: for requests that are representable in a small sizeclass, first check the local cache. If there are no entries in the small fast freelist, invoke the core allocator small_alloc.
dealloc: deallocation is delegated to the core allocator

A core allocator is a stateful allocator containing slabs alloc_classes and other fields like laden, entropy, backend_state, attached_cache. Slabs are fixed-size allocations that can be split into smaller. It interfaces with the backend to expose a higher level message object allocation API:

small_alloc: check for a slab of the requested small sizeclass and get an alloc from the slab. Then update the slab state to inactive if it becomes inactive (moved to laden). BackendSlabMetadata::alloc_free_list fills the small fast freelists in the local cache with the remaining elements in the slab.
- small_alloc_slow: if there is no slab that serves allocs of the requested size, request the backend for a chunk to construct a new slab.
dealloc_local_object
- dealloc_local_object_fast: the alloc is inserted into its associated slab freelist.
- dealloc_local_object_slow: the slow path is triggered when the associated slab has needed_ == 0. If the case of:
  - A large alloc: the chunk is returned to the backend (dealloc_chunk)
  - A sleeping slab: the addition of a freed object to its freelist hits the threshold to wake up the slab to start servicing requests
  - An unused slab after this deallocation: this can trigger the deallocation of unused slabs of the same sizeclass if there are enough of them

    /**
     * The number of deallocation required until we hit a slow path. This
     * counts down in two different ways that are handled the same on the
     * fast path.  The first is
     *   - deallocations until the slab has sufficient entries to be considered
     *   useful to allocate from.  This could be as low as 1, or when we have
     *   a requirement for entropy then it could be much higher.
     *   - deallocations until the slab is completely unused.  This is needed
     *   to be detected, so that the statistics can be kept up to date, and
     *   potentially return memory to the a global pool of slabs/chunks.
     */
    uint16_t needed_ = 0;

This is a snipped of the core allocator fields.

// class CoreAllocator
// ...
    /**
     * Define local names for specialised versions of various types that are
     * specialised for the back-end that we are using.
     * @{
     */
    using BackendSlabMetadata = typename Config::Backend::SlabMetadata;
    using PagemapEntry = typename Config::PagemapEntry;
    /// }@

    /**
     * Per size class list of active slabs for this allocator.
     */
    struct SlabMetadataCache
    {
      SeqSet<BackendSlabMetadata> available{};

      uint16_t unused = 0;
      uint16_t length = 0;
    } alloc_classes[NUM_SMALL_SIZECLASSES]{};

    /**
     * The set of all slabs and large allocations
     * from this allocator that are full or almost full.
     */
    SeqSet<BackendSlabMetadata> laden{};
// ...

As snmalloc is a message passing allocator, you can expect other message queue operations: init_message_queue, handle_message_queue, post, flush… that I won’t dig in just now.

Additionally, each local allocator is associated with a remote allocator. It represents a message queue of freed objects so that an object can be allocated by one allocator and deallocated by a different allocator in a message passing fashion.

// mem/remoteallocator.h
  struct alignas(REMOTE_MIN_ALIGN) RemoteAllocator
  {
    /**
     * Global key for all remote lists.
     *
     * Note that we use a single key for all remote free lists and queues.
     * This is so that we do not have to recode next pointers when sending
     * segments, and look up specific keys based on destination.  This is
     * potentially more performant, but could make it easier to guess the key.
     */
    inline static FreeListKey key_global{0xdeadbeef, 0xbeefdead, 0xdeadbeef};

    using alloc_id_t = address_t;

    // Store the message queue on a separate cacheline. It is mutable data that
    // is read by other threads.
    alignas(CACHELINE_SIZE) freelist::AtomicQueuePtr back{nullptr};
    // Store the two ends on different cache lines as access by different
    // threads.
    alignas(CACHELINE_SIZE) freelist::AtomicQueuePtr front{nullptr};
    // Fake first entry
    freelist::Object::T<capptr::bounds::AllocWild> stub{};
    // [...]
  }

MetaEntry vs SlabMetadata⌗

Every allocation, be it owned by frontend or backend, has a metaentry stored in the pagemap. However, allocs in the frontend have additional metadata related to the slab and they are stored in SlabMetadata objects. Backend chunks don’t have this same slab metadata because this slab representation is specific to the frontend.

Metaentries are stored in a pagemap. The default FlatPagemap implementation has an array of pointers to metaentries (body).

// ds/pagemap.h
  /**
   * Simple pagemap that for each GRANULARITY_BITS of the address range
   * stores a T.
   */
  template<size_t GRANULARITY_BITS, typename T, typename PAL, bool has_bounds>
  class FlatPagemap
  {
  public:
    static constexpr size_t SHIFT = GRANULARITY_BITS;
    static constexpr size_t GRANULARITY = bits::one_at_bit(GRANULARITY_BITS);

  private:
    /**
     * Before init is called will contain a single entry
     * that is the default value.  This is needed so that
     * various calls do not have to check for nullptr.
     *   free(nullptr)
     * and
     *   malloc_usable_size(nullptr)
     * do not require an allocation to have ocurred before
     * they are called.
     */
    inline static const T default_value{};

    /**
     * The representation of the page map.
     *
     * Initially a single element to ensure nullptr operations
     * work.
     */
    T* body{const_cast<T*>(&default_value)};

    /**
     * The representation of the pagemap, but nullptr if it has not been
     * initialised.  Used to combine init checking and lookup.
     */
    T* body_opt{nullptr};

    /**
     * If `has_bounds` is set, then these should contain the
     * bounds of the heap that is being managed by this pagemap.
     */
    address_t base{0};
    size_t size{0};
// [...]
  }

A metaentry consists of two fields of the size of a pointer of the target architecture: meta and remote_and_sizeclass (abbreviated ras). For an alloc, meta points to its SlabMetadata object, but for chunks in the backend, it can have other meanings. For an alloc, ras encodes a reference to the owning frontend allocator’s message queue RemoteAllocator and the sizeclass. Again, this field has a different meaning for backend owned chunks.

  class MetaEntryBase
  {
  protected:
  // [...]
    /**
     * In common cases, the pointer to the slab metadata.  See
     * docs/AddressSpace.md for additional details.
     *
     * The bottom bit is used to indicate if this is the first chunk in a PAL
     * allocation, that cannot be combined with the preceeding chunk.
     */
    uintptr_t meta{0};

    /**
     * In common cases, a bit-packed pointer to the owning allocator (if any),
     * and the sizeclass of this chunk.  See `encode` for
     * details of this case and docs/AddressSpace.md for further details.
     */
    uintptr_t remote_and_sizeclass{0};
  // [...]
  };

Every alloc has a slabmetadata object which links it to a slab. A slab consists of a freelist (or two in case the random_preserve mitigation is enabled) of freed allocs.

// [...]
    /**
     *  Data-structure for building the free list for this slab.
     */
    SNMALLOC_NO_UNIQUE_ADDRESS freelist::Builder<mitigations(random_preserve)>
      free_queue;

    /**
     * The number of deallocation required until we hit a slow path. This
     * counts down in two different ways that are handled the same on the
     * fast path.  The first is
     *   - deallocations until the slab has sufficient entries to be considered
     *   useful to allocate from.  This could be as low as 1, or when we have
     *   a requirement for entropy then it could be much higher.
     *   - deallocations until the slab is completely unused.  This is needed
     *   to be detected, so that the statistics can be kept up to date, and
     *   potentially return memory to the a global pool of slabs/chunks.
     */
    uint16_t needed_ = 0;

    /**
     * Flag that is used to indicate that the slab is currently not active.
     * I.e. it is not in a CoreAllocator cache for the appropriate sizeclass.
     */
    bool sleeping_ = false;

    /**
     * Flag to indicate this is actually a large allocation rather than a slab
     * of small allocations.
     */
    bool large_ = false;
// [...]

Backend allocation⌗

The backend manages memory allocations from the platform. This layer deals with Chunk-bounded pointers capptr::Chunk<void>. For every object allocation request, the frontend actually asks for two chunks, one for the slab metadata and one for the user data chunk. Some key functions are:

alloc_meta_data
dealloc_meta_data
alloc_chunk
dealloc_chunk

  /**
   * This class implements the standard backend for handling allocations.
   * It is parameterised by its Pagemap management and
   * address space management (LocalState).
   */
  template<
    SNMALLOC_CONCEPT(IsPAL) PAL,
    typename PagemapEntry,
    typename Pagemap,
    typename Authmap,
    typename LocalState>
  class BackendAllocator

The metadata and the user data chunks are allocated differently. Metadata chunks are serviced by local_state->get_meta_range() while user data chunks are serviced by local_state->get_object_range(). What are these ranges? Ranges are similar to passes in compilers: it takes a pointer to a chunk and applies a set of operations to it before returning it. A range exposes the APIs alloc_range and dealloc_range.

LocalState contains information about the ranges used by a backend allocator. For example, the default LocalState is defined as:

    using LocalState = std::conditional_t<
      mitigations(metadata_protection),
      MetaProtectedRangeLocalState<Pal, Pagemap, Base>,
      StandardLocalState<Pal, Pagemap, Base>>;

And StandardLocalState exposes get_object_range and get_meta_range, so that user chunks are allocated through LargeObjectRange while metadata requests are allocated through ObjectRange (SmallBuddyRange then LargeObjectRange).

// ...
  private:
    using ObjectRange = Pipe<LargeObjectRange, SmallBuddyRange>;

    ObjectRange object_range;

  public:
    // Expose a global range for the initial allocation of meta-data.
    using GlobalMetaRange = Pipe<ObjectRange, GlobalRange>;

    /**
     * Where we turn for allocations of user chunks.
     *
     * Reach over the SmallBuddyRange that's at the near end of the ObjectRange
     * pipe, rather than having that range adapter dynamically branch to its
     * parent.
     */
    LargeObjectRange* get_object_range()
    {
      return object_range.template ancestor<LargeObjectRange>();
    }

    /**
     * The backend has its own need for small objects without using the
     * frontend allocators; this range manages those.
     */
    ObjectRange& get_meta_range()
    {
      // Use the object range to service meta-data requests.
      return object_range;
    }
// ...

Other concepts⌗

Other concepts (I will dig into in the future):

Domestication: see capptr_domesticate. This concerns inter-thread memory allocation requests.
Authmap: this is used to enforce strict provenance in AALs that support strict provenance (currently only AALs with AAL_CHERI mixin).

Freelist and SeqSet structs⌗

Snmalloc uses four iterable structures to link objects.

Freelist Builder⌗

A builder can contain two freelist queues depending on whether random_preserve mitigation is enabled. As per the comments, it is used to build a freelist in object space.

What is the structure of a freelist builder?
- head: pointer to the first element, aka FreelistPtr
- end: pointer to pointer to the last element, aka pointer to FreelistPtr
What uses a freelist builder?
- FrontendSlabMetadata->free_queue keeps track of active Allocs under the slab
- RemoteDeallocCache->lists is an array of Builder
How is it iterated?
- Empty case: end == &head
- First element: curr = head
- Next element: curr.next_object
- Last element: curr == end (curr included)

Freelist Iter⌗

Used to iterate a freelist in object space.

What is the structure of a freelist iter?
- curr: freelist pointer, aka pointer to the first element
What uses a freelist iter?
- LocalCache->small_fast_free_lists is a primitive array of freelist iters
How is it iterated?
- Empty case: curr == nullptr
- First element: curr
- Next element: curr.next_object
- Last element: right before the empty case (null terminator)

SeqSet (sequential set)⌗

Doubly-linked cyclic list linked using T::node field. Used to group slab metadata in metadata space (not used in object space).

What is the structure of a seqset?
- head: a SeqSetNode (containing pointers next and prev)
What uses a seqset?
- Core allocator laden groups FrontendSlabMetadata
- Core allocator alloc_classes contained available which groups FrontendSlabMetadata
How is it iterated?
- Empty case: head.next == &head
- First case: curr = head.next
- Next case: curr.next (or curr.prev)
- Last case: right before the empty case

Remote deallocation queue⌗

The remote deallocation queue uses a different iterable struct.

What is the structure of a remote deallocation queue?
- back: freelist pointer
- front: freelist pointer
- stub: fake first entry
How is it iterated?
- Empty case: front == &stub or back == nullptr
- First element: curr = front
- Next element: curr.next_object
- Last element: right before curr == nullptr or curr == back (curr not included)

Security checks⌗

The security checks are nicely grouped in the ds_core/mitigations.h header file. One future research question would be, how do these security mitigations affect the exploitability of memory safety bugs? One could test the added exploit constraints under different combinations of enabled security mitigations.

  /**
   * Randomize the location of the pagemap within a larger address space
   * allocation.  The other pages in that allocation may fault if accessed, on
   * platforms that can efficiently express such configurations.
   *
   * This guards against adversarial attempts to access the pagemap.
   *
   * This is unnecessary on StrictProvenance architectures.
   */
  constexpr mitigation::type random_pagemap{1 << 0};
  /**
   * Ensure that every slab (especially slabs used for larger "small" size
   * classes) has a larger minimum number of objects and that a larger
   * percentage of objects in a slab must be free to awaken the slab.
   *
   * This should frustrate use-after-reallocation attacks by delaying reuse.
   * When combined with random_preserve, below, it additionally ensures that at
   * least some shuffling of free objects is possible, and, hence, that there
   * is at least some unpredictability of reuse.
   *
   * TODO: should this be split? mjp: Would require changing some thresholds.
   * The min waking count needs to be ensure we have enough objects on a slab,
   * hence is related to the min count on a slab.  Currently we without this, we
   * have min count of slab of 16, and a min waking count with this enabled
   * of 32. So we would leak memory.
   */
  constexpr mitigation::type random_larger_thresholds{1 << 1};
  /**
   *
   * Obfuscate forward-edge pointers in intra-slab free lists.
   *
   * This helps prevent a UAF write from re-pointing the free list arbitrarily,
   * as the de-obfuscation of a corrupted pointer will generate a wild address.
   *
   * This is not available on StrictProvenance architectures.
   */
  constexpr mitigation::type freelist_forward_edge{1 << 2};
  /**
   * Store obfuscated backward-edge addresses in intra-slab free lists.
   *
   * Ordinarily, these lists are singly-linked.  Storing backward-edges allows
   * the allocator to verify the well-formedness of the links and, importantly,
   * the acyclicity of the list itself.  These backward-edges are also
   * obfuscated in an attempt to frustrate an attacker armed with UAF
   * attempting to construct a new well-formed list.
   *
   * Because the backward-edges are not traversed, this is available on
   * StrictProvenance architectures, unlike freelist_forward_edge.
   *
   * This is required to detect double frees as it will break the doubly linked
   * nature of the free list.
   */
  constexpr mitigation::type freelist_backward_edge{1 << 3};
  /**
   * When de-purposing a slab (releasing its address space for reuse at a
   * different size class or allocation), walk the free list and validate the
   * domestication of all nodes along it.
   *
   * If freelist_forward_edge is also enabled, this will probe the
   * domestication status of the de-obfuscated pointers before traversal.
   * Each of domestication and traversal may probabilistically catch UAF
   * corruption of the free list.
   *
   * If freelist_backward_edge is also enabled, this will verify the integrity
   * of the free list links.
   *
   * This gives the allocator "one last chance" to catch UAF corruption of a
   * slab's free list before the slab is de-purposed.
   *
   * This is required to comprehensively detect double free.
   */
  constexpr mitigation::type freelist_teardown_validate{1 << 4};
  /**
   * When initializing a slab, shuffle its free list.
   *
   * This guards against attackers relying on object-adjacency or address-reuse
   * properties of the allocation stream.
   */
  constexpr mitigation::type random_initial{1 << 5};
  /**
   * When a slab is operating, randomly assign freed objects to one of two
   * intra-slab free lists.  When selecting a slab's free list for allocations,
   * select the longer of the two.
   *
   * This guards against attackers relying on object-adjacency or address-reuse
   * properties of the allocation stream.
   */
  constexpr mitigation::type random_preserve{1 << 6};
  /**
   * Randomly introduce another slab for a given size-class, rather than use
   * the last available to an allocator.
   *
   * This guards against attackers relying on address-reuse, especially in the
   * pathological case of a size-class having only one slab with free entries.
   */
  constexpr mitigation::type random_extra_slab{1 << 7};
  /**
   * Use a LIFO queue, rather than a stack, of slabs with free elements.
   *
   * This generally increases the time between address reuse.
   */
  constexpr mitigation::type reuse_LIFO{1 << 8};
  /**
   * This performs a variety of inexpensive "sanity" tests throughout the
   * allocator:
   *
   * - Requests to free objects must
   *   - not be interior pointers
   *   - be of allocated address space
   * - Requests to free objects which also specify the size must specify a size
   *   that agrees with the current allocation.
   *
   * This guards gainst various forms of client misbehavior.
   *
   * TODO: Should this be split? mjp: It could, but let's not do this until
   * we have performance numbers to see what this costs.
   */
  constexpr mitigation::type sanity_checks{1 << 9};
  /**
   * On CHERI, perform a series of well-formedness tests on capabilities given
   * when requesting to free an object.
   */
  constexpr mitigation::type cheri_checks{1 << 10};
  /**
   * Erase intra-slab free list metadata before completing an allocation.
   *
   * This mitigates information disclosure.
   */
  constexpr mitigation::type clear_meta{1 << 11};
  /**
   * Protect meta data blocks by allocating separate from chunks for
   * user allocations. This involves leaving gaps in address space.
   * This is less efficient, so should only be applied for the checked
   * build.
   */
  constexpr mitigation::type metadata_protection{1 << 12};
  /**
   * If this mitigation is enabled, then Pal implementations should provide
   * exceptions/segfaults if accesses do not obey the
   *  - using
   *  - using_readonly
   *  - not_using
   * model.
   */
  static constexpr mitigation::type pal_enforce_access{1 << 13};

GEF plugin for snmalloc⌗

I have developed a GEF plugin that adds a heap manager that understands snmalloc heap. It is limited to the default configuration and the default set of security mitigations. You can read more in this other blog post or check it out.

Note: the plugin requires GEF with CHERI support because it assumes a separation of address space size (adrsize) and pointer size (ptrsize). The original GEF assumes adrsize == ptrsize but this is not true for CHERI-enabled architectures.