Custom Memory Allocator

Building a vector using static memory.

When Leetcoding, I always have to make decision on whether to use std::vector or std::array. For starters, std::vector uses heap (dynamic memory allocation) while std::array uses stack (static memory allocation). Accessing static memory is significantly faster than accessing dynamic memory. The drawback is that static memory needs to be pre-initialized, and there is a limit to how much stack memory we can use.

“When to use std::vector vs std::array in LeetCode?” is a question for another post. – The short answer is, we prefer std::array whenever we would want to use a vector and storing a “small” amount of data.

For this post, we focus on how to build a custom memory allocator.

With a custom allocator, we will have a std::vector with automatic balancing between stack and heap memory allocation so that I can always use std::vector’s interface without having to work with preallocated std::array.

std::vector always uses heap allocation by default.

See the complete code example for this section here.

Take a look at this example, how many bytes have we allocated and deallocated in total?

std::vector<int> v; //0
v.push_back(1); //1
v.push_back(1); //2
v.push_back(1); //3
v.push_back(1); //4
v.push_back(1); //5
v.push_back(1); //6

At a first glance, we created a vector and pushed 6 ints into it. It should be 6*sizeof(int)==24 bytes right?

  • No, the exact allocation is more complicated because of the vector resize behavior. However, if call vector::reserve(6) at the beginning we can avoid the resizing at push_back and find that we allocated exactly 24 bytes.
    • Assuming the vector itself is created on the stack, so line 0 should takes 0 allocation.
    • By default, v has 0 capacity, so vector has to make space for an int on line 1, which is sizeof(int)==4 bytes.
    • After line 2, we’ll need 8 bytes to store 2 int’s so resize kicks in and expanded the size to be 2. – vector::resize will first create a new container of size 4 bytes than copy the original 2 bytes in there, and destroy the original 2 bytes. Hence, an allocation of 4 bytes and a deallocation happens here.
    • Exactly how much the vector resize expands its capacity depends on its resize factor, which is 2 for the implementations I’m testing.
    • I’ll skip the explanation for line 3-6 here as they behave accordingly.
      v.push_back(1); //1  - allocate 4 bytes
      v.push_back(1); //2  - resize: deallocate 4 bytes and allocate 8 bytes
      v.push_back(1); //3  - resize: deallocate 8 bytes and allocate 16 bytes
      v.push_back(1); //4  - 4 ints (16 bytes) can fit in 16 bytes, nothing happens
      v.push_back(1); //5  - resize: deallocate 16 bytes, and allocate 32 bytes
      v.push_back(1); //6  - 6 ints (24 bytes) can fit in 32 bytes, nothing happens
      

      To observe the exact allocation, we can override the global new() and delete() operator like this:

      void * operator new(size_t size)
      {
          cout<< "Allocating " << size << "bytes" << endl;
          void * p = malloc(size);
          return p;
      }
      void operator delete(void * p)
      {
          cout<< "Deallocating" << endl;
          free(p);
      }
      

Creating a custom allocator.

See the complete code example for this section here.

To create a custom allocator, we need to satisfy the Allocator Completeness requirements.

The most minimum functions we need to provide for a template T are:

  • allocate -> T*
  • deallocate -> void
  • copy, copy-assign, move, move-assign constructors do not throw.
  • operator== and operator!=

Obviously, the most important parts are allocate/deallocate. In my example of ShortAlloc, we called allocate/deallocate functions from Arena which manages the memory resource.

  • Arena maintains a internal stack buffer_ a plain old byte array allocated on the stack.
    • a pointer ptr_ to the internal stack to note where to allocate new objects into
    • If the stack memory runs out, we simply use the global new operator to allocate from heap. All the buzz about alignment is absolutely required, but they just try to align the allocation so that each new allocation is aligned to std::max_align_t for faster cpu cache access.

Prefer std::pmr::polymorphic_allocator over custom allocator mentioned above.

See the complete code example for this section here.

ShortAlloc and Arena can be used for custom memory allocation because they provide the functions required by Allocator Completeness requirements. The drawback to this – the custom allocator type name becomes part of the vector using it, so we cannot easily use it without knowing the exact allocator type at compile time.

  • e.g. if a functions takes in a std::vector as input, we cannot pass an input of type vector<DataType, ShortAlloc<...>> to it as std::vector with default allocator is a different type.
  • The solution is to use std::pmr::vector, which achieves run time polymorphism by allowing std::pmr::vector using different allocators.
    To create a custom type of std::pmr::memory_resource, we implement these interfaces in our implementation MyStackResource
    void* do_allocate(std::size_t bytes, std::size_t alignment) override;
    void do_deallocate(void *p, size_t bytes, size_t alignment) override;
    bool do_is_equal(const memory_resource &other) const noexcept override;
    

    The implementation detail of MyStackResource is almost the same as that of Arena’s implementation, except for that this time we are aligning objects with it’s actual size instead of std::max_align_t.

Testing performance.

See here to compile and run my entire test.

As mentioned in the Notes section, performance is NOT a most important reason for using custom allocator. – I wanted to create a test scenario to illustrate the point.

Here’s the scenario,

We want to maintain a cache of MANY vectors and each of them is guaranteed to have a SMALL size.

With my custom allocators that uses stack memory, we can avoid having to allocate to heap at runtime, as long as all the vectors and their data can fit into the stack memory.

Here are the results for creating 1562500 vectors with size of 5.

Allocator Types Init Iterate Destruct
ShortAlloc 1 ms 1 ms 0 ms
MyStackResource (pmr) 2 ms 1 ms 0 ms
std default heap 39 ms 2 ms 25 ms
  • Init is measuring initializing all the vectors.
  • Iterate is looping through each element in each vector.
  • Destruct is destroying the vectors when scope ends. Obviously, the std heap allocation had much overhead over static allocation, but again, the scenario is not that realistic.

Notes

  • While researching for this blog, I found that performance is not the most important reason for using custom memory allocator. Taming dynamic memory - An introduction to custom allocators in C++ - Andreas Weis - code::dive 2018
    • We use custom memory allocators to avoid the common problems of default std::allocator like:
      • Memory coalescing
      • Memory fragmentation
      • Shared global state – all allocation uses the same allocator.
        • e.g. a potential lock on a global allocator.
    • For performance reasons, we have to go case by case to evaluate whether a custom allocator (or custom pmr::memory_resource) is indeed needed.
  • My example of a “Short Vector” is a solution looking for a problem. From this discussion, there are existing non-standard solution like boost::container::small_vector that works better if you really need a Small Vector Optimization solution.

© 2022. All rights reserved.