Cycle precise delay loop in C++

1st Jul 2021 algorithm c++ embedded Marlin

Implementing a cycle-precise delay loop in C++ for embedded system

In a recent work on Marlin's printer firmware, I found a pending pull request about waiting for precise, ideally nanosecond timed, delays.

While it looked like easy at first, the implementation proved tricky. This post explains the difficulties and the implementation details.

Numerous platform

Marlin is a firmware that target numerous platform (from AVR CPU, to ARM Cortex M0, Cortex M3, Cortex M4, Cortex M7, ESP32, AMD64...). Implementing a common delay loop for all these platforms is almost impossible.

So I focused on ARM platform only.

Even in that case, the number of platform implied many different timing behavior, due to:

Inconsistent number of cycle depending on the platform (while Cortex M0 and M0+ have a fixed cycle count per instruction, Cortex M3/M4/M7 are pipelined and branch predicted)
Actual hardware inconsistency (STM32 fetch their instructions directly from Flash, with a 5 cycles penalty, unless they are loaded in instruction cache)
Presence of an hardware cycle counter (not required for Cortex M0)

Technical details for ARM cpu

Here are the links to compute the cycle count for a bare Cortex CPU. Actual implementation will differ from these (because of Flash or external RAM instruction fetching penalty, etc...):

Cortex M0
Cortex M3
Cortex M4
Cortex M7, it seems that if the assembly is in Thumb mode, the cycle count should be more or less the same as M4. Otherwise, it's not possible to compute the cycle count of instructions.

Description of an ideal solution

So, let's jump to Wonderprinter universe for a second. In this universe, the Rainbow Lantern first's wish was to have a delay of exactly 182ns before being enabled, then a delay of exactly 34ns before disabling. In this universe, the physic is simple. The Big CPU Master executes exactly one instruction per cycle and it can run at infinite clock speed without consuming a single watt of power.

The evil Entropyman is sadly lurking around and is attracted by innocent CPU going too fast like fruit flies are attracted by marmelade. How fast should the Big CPU Master goes to reach fulfill the Rainbow Lantern without attracting Entropyman ?

The obvious answer is that the clock period must be equal to the GCD of the two delays (in this world, it's 2ns). So the Big CPU Master must clock at 1/2µs = 500MHz.

It'll then count 91 cycles before enabling the Rainbow Lantern and then 17 cycles before disabling it.

Unfortunately, the evil Entropyman is very sensitive to such high frequency, and, for safety reason because I don't want to betray any kids reading this story with violence and horror pictures, anything above 84 MHz will attract the Bad Guy as sure as sugar attracts ants and cavities.

With such a low frequency, each CPU cycle will use 1/84MHz or 11.9ns. This means that the Rainbow Lantern will have to accept a small deviation to its wishes, since the best that can be done will be 178.5ns for high time (15 cycles), and 35.7ns (with 3 cycles of delay).

Yet. yet...

The Big CPU Master does not know by advance the wishes of the Rainbow Lantern. When the Rainbow Lantern decides its new wish, the Big CPU Master has to compute how many cycles to wait, and he has to interrupts what he was doing to fulfill the lovely lantern's dreams. This also takes some cycles.

In this universe, the Big CPU Master can asks little dwarfs to compute the number of cycles for him so he's not spending his precious time performing the boring computations. Yet, those pesky creatures refuse to deal with spoken value, they only want written value on a book, else they go on strike (they have half French blood, half magical blood).

They say that a spoken value can change while the flux is streaming while a written value never change.

Big CPU Master is reluctant to accept such condition entirely, so he decides to accept the solution partially. If the value is written in a book little dwarf will precompute. If it's not, he'll take some time to compute it himself. Since he knows how long such calculus takes, he then ask the Rainbow Lantern to never tell him a spoken wish lower than his calculus time.

Yet. yet...

Big CPU Master is getting old by now, and interrupting his work takes more time than in his enfancy. He now has to close the book he was working on, open another book to count cycles and then close it back once he's done... to reopen the previous book. He's wondering if some mystical feary would be able to let him avoid this switching delay. It turns out that Mz Inline Intrinsics knows an old magical spell that does exactly that.

Back to our universe...

Technical description of the need

We want a solution that'll:

Wait for the closest cycle to the asked delay
Does not wait more than the asked delay (for example, if the CPU is interrupted, it should not linger)
Can wait for, as low as, one cycle delay to few microseconds (if you need more, don't attract Entropyman, you'd better use a hardware timer for that)
Fallback to hardware timer whenever possible
Ideally with no hand tweaking of the code for each platform
Ideally with no manual calibration of each platform and no database of cycle per instruction
That can be maintained easily

In language's layer term we would like that the compiler write this (pseudo assembly):

main:
   [previous_code()]
   nop
   nop
   nop
   nop
   nop
   [some_other_thing()]
   call cycle_wait(210)

from such source:

int main() {
    previous_code();
    DELAY_CYCLES(5); // Or DELAY_NS(65)
    some_other_thing();
    DELAY_CYCLES(210);
}

Solution for this puzzle

Because Marlin is written in C++, our little dwarf is sleeping in the compiler waiting to perfom computations at compile time.

When the code calls a DELAY function, there are 2 possibilities. Either the argument of the function is known at compile time, either it's a variable that will evolve at runtime. In the latter case, we'll jump to a function that simply counts cycles. Ideally, we must know how many cycle are being wasted by jump to and back this function and subtract it to the function's count.

In the former case, we need our little dwarf to compute the number of cycle to wait for and to gruntly insert as many sleepy cycles as he found in our current instruction flow. We don't want him to call another function since that would add unwanted cycles to go there and come back. Yet, we still have to monitor him, because sometime he's dumb and we don't want him to insert too many sleepy cycles. Instead, we'll ask him to replace all those cycles by a call to a loop function when the count of sleepy cycles is greater than the cost to call the loop function.

Wait...

You said that each platform is different, how do you deal with such differences ?

Calibration is the word. Upon boot, the system will use an (imprecise) timer to measure a large cycle count delay. Once this is done, figuring out the platform's oddities is a mathematical problem that's stored in a static variable. Let the printer do the math for us, so we don't have to maintain a database of values that might evolve with compiler version or optimization level.

Umphhh... How do you write "sleepy cycle" in C++ based on a compile time variable ?

There is a low quality trick and the high quality trick. The low quality trick is to have a bunch of switch case like this:

switch(cycle_count) {
    case 10: sleepy_cycle(); // No break here
    case  9: sleepy_cycle();
        ...
    case  1: sleepy_cycle();
    case  0: break;
}

If the compiler isn't too stupid, he'll generate an array of sleepy_cycle instruction and jumps in the middle of the array based on the requested cycle_count. It does.

Yet, this solution is not very fun, because you must write as many switch case as you expect the loop function's overhead to take (on my printer, it's ~ 40 cycles, but it's different for any printer). Let's say you had a new shiny printer with a 4THz CPU that needs 400 cycles to call a function? Right, you need to hand update your code or it'll break again. In addition, this is not a function level construction, it must be written inside a function so you'll still have the overhead of calling this function.

The high quality trick however is much better and less verbose:

template <int N> struct SleepCycleWriter 
{
  inline static void sleep() { sleepy_cycle(); SleepCycleWriter<N-1>::sleep(); }
};

// End the recursion with an empty function
template <> struct SleepCycleWriter<0> { inline static void sleep() {} };

Here, anytime you instantiate SleepCycleWriter<N>::sleep(), the compiler recursively emits N sleepy_cycle(). You need 400 cycles? No problem.

No function calling overhead whatsoever, since all the sleep() function are inline.

Ok, everything seems fine now. But how do you know if a variable is a compile time or runtime ? How do you act differently ?

In C++ (version 2014 that Marlin is using), it's not possible to react differently on compile time or runtime... unless using a bunch of non-standard tricks. This is because a compile-time variable can be used in a template parameter, but a runtime variable can not. This is incorrect C++:

int x = 10; SleepCycleWrite<x>::sleep();

So unless doing an ugly:

switch (cycle_count) { case 10 : SleepCycleWriter<10>::sleep(); break; ...}

There is no standard optimal solution (even in C++20... yet).

To solve this issue, I'm relying on a compiler extension (called __builtin_constant_p). This magical symbol check if the given parameter is a compile time value (a constexpr in C++'s guru terms) but does not fail if it's not (unlike if constexpr in C++17 and later).

It also provide a nice advantage, when used like this:

constexpr int o = __builtin_constant_p(count) ? count : 0;

In that specific case, I'm building a constant expression (a compile time value) from a possibly runtime value: this is the given count if the count is a compile time value or the constant 0.

The magic is in the ? count part here, since if count is not a compile time value, the operand is not evaluated and does not generate an compile time error (unlike if constexpr statement).

After that, everything is pretty straightforward, o is a compile time value that can be used in the template parameter. We just need to specialize some template on this parameter so if it's 0 (or -1 or whatever, just need to be consistant here), we fallback to a runtime call of the loop function.

Results

Well, everything boils down to results anyway. Why bother with all this mess if it does not give excellent results ?

In fact, it does:

Calling the DELAY_CYCLES macro directly for 1cycles took: 1cycles
Calling the DELAY_CYCLES macro directly for 5cycles took: 5cycles
Calling the DELAY_CYCLES macro directly for 10cycles took: 11cycles
Calling the DELAY_CYCLES macro directly for 20cycles took: 20cycles
Calling the DELAY_CYCLES macro directly for 50cycles took: 49cycles
Calling the DELAY_CYCLES macro directly for 100cycles took: 102cycles
Calling the DELAY_CYCLES macro directly for 200cycles took: 207cycles
Calling the DELAY_CYCLES macro directly for 500cycles took: 497cycles

Less than one cycle of error for small delay and less than 7 cycles of error for large delay.

You can see by yourself in this godbolt

Previous Post Next Post