Dynamic embedded memory

18th Mar 2022 software algorithm c++ embedded

Dynamic embedded memory management

When you have an embedded system with limited memory, you are forced to think twice about how you are going to spend your budget. The most obvious solution is to avoid dynamic memory allocation and deal with static pool of memory you are managing yourself completely.

But sometimes, you want to use some existing code that makes use of the heap (either via malloc or new). In that case, you must be aware of the pitfalls for those.

This post tries to explain and gives some advice to optimize and minimize your memory impact. It's written for very limited memory system (those with less or few megabytes of memory, not gigabytes). I'm using ESP32 as a example platform, but the advice written below will apply for many different platform.

Heap vs Stack vs fixed storage

On actual CPU architecture, there are three kind of memory storage you can make use of:

fixed storage set up at compilation
stack space
heap space

Fixed storage

When you write a software for the embedded system, you have to describe the memory layout of the system to the linker so it place the executable code at some specific address. But it also place your static variables in some specific space (many developers call them .data section, but it depend on the architecture). Usually, this step is done for you by the IDE you are using. There are 3 kinds of data you can use this way:

read only data (typically constants, declared with const keyword in C/C++). This includes strings like "blue" and many other automagically written tables (like virtual tables in C++, switch/case jump tables and so on). These data usually ends up in read only data sections. There can be multiple sections that the linker will concatenate for the final firmware building.
read-write data with initialization (typically, like char lut[8] = {0, 1, 2, 3, 4, 5, 6, 7}; in C/C++ or simply a global int a = 34;). There data requires 2 steps to be usable from your program. They need to have storage allocated for them and they need to have initialization code to be called to fill them. This is done for you by the linker that creates 2 entries in the .data section and initialization section.
read-write data with no initialization (like char buffer[1024]; in C/C++). In that case, the linker will reserve some memory space in the memory layout for these variables, but will not emit any entry in the data section. Instead, it will create some entries in the .bss section to say "hey, there's 1024 bytes required for this variable, initialize it when you start". It does not consume any binary space, only RAM space and require cooperation from the bootloader to create/zero the space before the control is given to your program.

In C++, you'll find the std::array structure as the best tool to deal with such buffer. The biggest drawback from these buffers is that they can not resize (either smaller or larger). They also consume memory even if not used. A buffer in some obscure function that's only called upon a very rare error case will eat your memory layout and you can't claim it back.

On ESP32, you have many tools to list them, one is idf.py size-components.

On POSIX system, you can also use BloatyMcFace (or objdump -tT) to analyze a software for each symbol eating this space

Stack space

The stack is a memory that's set up before the control is given to your function. It's fixed size and consumed by functions to store their local variables and return address. You can allocate dynamically from this memory with functions like alloca (in C) or StackHeapBuffer (in C++) see here, but you risk overflowing the memory area (and you don't know how much memory is left when the control is reaching your code). So such usage is very dangerous unless you allocate plenty of stack space.

Allocating in the stack is almost free, since it implies adding your allocation size to the stack pointer. De-allocation is done automatically by the compiler when the control leaves the running function (by subtracting from the stack pointer). So this means no long term storage of your data here.

In order to estimate the required stack space, you can use -fstack-usage GCC's compiler option to have the compiler logs the stack usage for each functions. Yet, it's a PITA to deal with.

Heap space

While not always present, the heap is pool of memory that's often created from the memory left after every other consumer served on the memory layout following her needs. It's managed with some algorithm with a common interface:

malloc memory allocation. Request a buffer of the given size. If not possible returns null.
free memory de-allocation. Tell the allocator that this buffer is no more required. The returned buffer must have been allocated with malloc
realloc memory re-allocation. Ask to change the memory size of the given buffer to a new size. The returned buffer can be the same as the previous buffer, or a new buffer. In that case, the content of the buffer is copied to the new buffer. If not possible returns null, but does not de-allocate the given buffer.

The heap is used for permanent memory storage. It's not cleaned when leaving functions. Management of such memory is done manually (in C) or automatically if you use RAII (resource acquisition is initialization) in C++.

There are multiple drawback to the heap:

The algorithm to allocate and de-allocate is usually non deterministic (not true on ESP32). It means that the processing time is random and the result is not guaranteed.
The algorithm's memory pool will fragment over time. It means that even if you are using only 60% of your heap space, there might be cases were you can't allocate more
No automatic cleaning of the allocations. It can leak if your program does not return the memory to the heap.
The algorithm itself consume heap memory to manage the allocations. This is not negligible on small heap like few kB.

Fragmentation is probably the biggest issue because it's a creeping issue. For example, let's say you have 2 objects (A, B) of different size and allocate them in this order:

Allocate A [ AAAA ...... ]
Allocate B [ AAAABBB ..... ]
Allocate A [ AAAABBBAAAA ..... ]
Free first A [ ....BBBAAAA ..... ]
Allocate B [ BBB.BBBAAAA .... ]
Allocate A, it can't fit between the 2 B instance: [ BBB.BBBAAAAAAAA .... ]
The space here between both B instance is wasted and is not recoverable until both B instance are de-allocated.
Even in that case, you'll only be able to fit one A instance in the free space, so you'll still have wasted space [AAAA...AAAAAAAA .... ]

As you see, if you have some object with a long life, it'll cause fragmentation in the heap. Other time, this will always increase until the heap is unable to allocate anymore. We'll see below how to solve this (or, at least limit it as much as possible).

Strategies to get a better usage of the heap

Limit heap usage

The first strategy on embedded system is to limit heap usage to the maximum. This sounds obvious, but there are multiple programming paradigm to use for this.

For example, in C++ (or C, but it's harder), you'll prefer using view instead of memory managed objects.

Typically, try to use std::string_view instead of std::string. The former is a read-only string, and doesn't allocate any memory, only store a pointer on the string head and the view size. The latter copies the data around, allocating memory to store the copy. Worst, the allocation is done with a exponential growth factor (which makes sense on a PC but not on an embedded system).

It means that if your string is 257 bytes long, it'll consume 512 bytes of memory because who knows, the user might need it later.

The same applies for std::vector and other containers. If you can't use std::array, then track all usage of std::vector::push_back, std::copy and make sure there's a std::vector::reserve before for the new final expected size. You can also call std::vector::shrink_to_fit after many vector insertion/erase operations to reduce the heap usage.

Change your standard library

If you know beforehand the expected size of your container in C++, you might be interested by ETL library that's almost a 1:1 equivalent to the STL not using the heap.

You can also use vector class that's not over allocating like this one. The trade off is to spend more CPU time doing allocations and overloading the heap allocator itself with many more small allocations. As usual YMMV.

Change the allocator itself

In C++, it's almost impossible to change the allocator for a small part of the code without a major rewrite of the C++ code itself. This is due to the ABI only allowing a single global new operator (and not namespace based operator new/delete). Yet, there are few way to still perform this, although a bit hacky:

enum class AllocatorMode {
    Default,
    Custom,
};

thread_local AllocatorMode currentMode;

void * operator new(size_t n) {
    switch(currentMode) {
        case Custom: return customAllocator(n);
        case Default: return ::malloc(n);
    }
}

void * operator new[](size_t n) { return operator new(n); }
void * operator delete(void * p) { 
    switch(currentMode) {
        case Custom: return customDeallocator(n);
        case Default: return ::free(n);
    }
}         
void * operator delete[](size_t n) { return operator delete(n); }

void setAllocatorMode(const AllocatorMode mode) { currentMode = mode; }

// Simple RAII class to swap the allocator while the instance exists
struct CustomAllocator
{
    CustomAllocator() { currentMode = Custom; }
    ~CustomAllocator() { currentMode = Default; }
};

Then in the code that must use the new allocator, you'll do this:

void someFunc() {
     CustomAllocator custom; // Set the allocator to use until this scope exits
     // call your library
     myLibrary.someMethod();
}

Here the allocator to use depends on a thread local variable, so as long as you call your library code within the same thread, it'll work.

Why do you want to change allocator ?

As shown above, when run continuously, your program will experience out of memory situation, due to heap fragmentation. I've yet to see any valid workaround for this case. All techniques I've experienced or learned about all fail either because they are unreliable (there is just too many code using the heap that it's almost impossible to write 100% reliable code in no-heap conditions), or impractical (writing a code that's 100% exception free takes twice the binary space and would limit to usage of libraries written the same way, which, AFAIK, does not exist down to the OS level).

For example, in C++, out of memory is expressed by an exception. You can't log this condition since writing to a stream uses the heap and you can't trigger an exception within the processing of an exception (this leads to std::terminate AKA game over).

Also, enabling exception handling in embedded development is not usually advised (due to the code bloat of exception unwinding). If your library make use of C code, and this C code uses setjmp/longjmp, then you're out of luck anyway.

So how to solve this ?

The nuclear solution is to run your code in a thread with its own allocator (an allocator that's managing a fixed size memory pool, not the system's heap). Upon out of memory event, you'll kill the thread and reset the allocator to its pristine state, then restart the thread. This only works if your code has no side effects (like managing a network state, or some hardware device). In that case, you'll need to register handlers that can be called upon the thread termination (and that don't consume new heap).

If the effect of restarting the thread is too painful to the user (compared to rebooting the system), then you'll have to store a state of the application in a fixed area and bootstrap from this state when the thread is starting

Previous Post Next Post