In a recent work on Marlin's printer firmware, I found a pending pull request about waiting for precise, ideally nanosecond timed, delays.
While it looked like easy at first, the implementation proved tricky. This post explains the difficulties and the implementation details.
Marlin is a firmware that target numerous platform (from AVR CPU, to ARM Cortex M0, Cortex M3, Cortex M4, Cortex M7, ESP32, AMD64...). Implementing a common delay loop for all these platforms is almost impossible.
So I focused on ARM platform only.
Even in that case, the number of platform implied many different timing behavior, due to:
Here are the links to compute the cycle count for a bare Cortex CPU. Actual implementation will differ from these (because of Flash or external RAM instruction fetching penalty, etc...):
So, let's jump to Wonderprinter universe for a second. In this universe, the Rainbow Lantern first's wish was to have a delay of exactly 182ns before being enabled, then a delay of exactly 34ns before disabling. In this universe, the physic is simple. The Big CPU Master executes exactly one instruction per cycle and it can run at infinite clock speed without consuming a single watt of power.
The evil Entropyman is sadly lurking around and is attracted by innocent CPU going too fast like fruit flies are attracted by marmelade. How fast should the Big CPU Master goes to reach fulfill the Rainbow Lantern without attracting Entropyman ?
The obvious answer is that the clock period must be equal to the GCD of the two delays (in this world, it's 2ns). So the Big CPU Master must clock at 1/2µs = 500MHz
.
It'll then count 91 cycles before enabling the Rainbow Lantern and then 17 cycles before disabling it.
Unfortunately, the evil Entropyman is very sensitive to such high frequency, and, for safety reason because I don't want to betray any kids reading this story with violence and horror pictures, anything above 84 MHz will attract the Bad Guy as sure as sugar attracts ants and cavities.
With such a low frequency, each CPU cycle will use 1/84MHz
or 11.9ns
. This means that the Rainbow Lantern will have to accept a small deviation to its wishes, since the best that can be done will be 178.5ns for high time (15 cycles), and 35.7ns (with 3 cycles of delay).
Yet. yet...
The Big CPU Master does not know by advance the wishes of the Rainbow Lantern. When the Rainbow Lantern decides its new wish, the Big CPU Master has to compute how many cycles to wait, and he has to interrupts what he was doing to fulfill the lovely lantern's dreams. This also takes some cycles.
In this universe, the Big CPU Master can asks little dwarfs to compute the number of cycles for him so he's not spending his precious time performing the boring computations. Yet, those pesky creatures refuse to deal with spoken value, they only want written value on a book, else they go on strike (they have half French blood, half magical blood).
They say that a spoken value can change while the flux is streaming while a written value never change.
Big CPU Master is reluctant to accept such condition entirely, so he decides to accept the solution partially. If the value is written in a book little dwarf will precompute. If it's not, he'll take some time to compute it himself. Since he knows how long such calculus takes, he then ask the Rainbow Lantern to never tell him a spoken wish lower than his calculus time.
Yet. yet...
Big CPU Master is getting old by now, and interrupting his work takes more time than in his enfancy. He now has to close the book he was working on, open another book to count cycles and then close it back once he's done... to reopen the previous book. He's wondering if some mystical feary would be able to let him avoid this switching delay. It turns out that Mz Inline Intrinsics knows an old magical spell that does exactly that.
Back to our universe...
We want a solution that'll:
In language's layer term we would like that the compiler write this (pseudo assembly):
main:
[previous_code()]
nop
nop
nop
nop
nop
[some_other_thing()]
call cycle_wait(210)
from such source:
int main() {
previous_code();
DELAY_CYCLES(5); // Or DELAY_NS(65)
some_other_thing();
DELAY_CYCLES(210);
}
Because Marlin is written in C++, our little dwarf is sleeping in the compiler waiting to perfom computations at compile time.
When the code calls a DELAY
function, there are 2 possibilities. Either the argument of the function is known at compile time, either it's a variable that will evolve at runtime.
In the latter case, we'll jump to a function that simply counts cycles. Ideally, we must know how many cycle are being wasted by jump to and back this function and subtract it to the function's count.
In the former case, we need our little dwarf to compute the number of cycle to wait for and to gruntly insert as many sleepy cycles as he found in our current instruction flow. We don't want him to call another function since that would add unwanted cycles to go there and come back. Yet, we still have to monitor him, because sometime he's dumb and we don't want him to insert too many sleepy cycles. Instead, we'll ask him to replace all those cycles by a call to a loop function when the count of sleepy cycles is greater than the cost to call the loop function.
Wait...
You said that each platform is different, how do you deal with such differences ?
Calibration is the word. Upon boot, the system will use an (imprecise) timer to measure a large cycle count delay. Once this is done, figuring out the platform's oddities is a mathematical problem that's stored in a static variable. Let the printer do the math for us, so we don't have to maintain a database of values that might evolve with compiler version or optimization level.
Umphhh... How do you write "sleepy cycle" in C++ based on a compile time variable ?
There is a low quality trick and the high quality trick. The low quality trick is to have a bunch of switch case like this:
switch(cycle_count) {
case 10: sleepy_cycle(); // No break here
case 9: sleepy_cycle();
...
case 1: sleepy_cycle();
case 0: break;
}
If the compiler isn't too stupid, he'll generate an array of sleepy_cycle
instruction and jumps in the middle of the array based on the requested cycle_count
. It does.
Yet, this solution is not very fun, because you must write as many switch case as you expect the loop function's overhead to take (on my printer, it's ~ 40 cycles, but it's different for any printer). Let's say you had a new shiny printer with a 4THz CPU that needs 400 cycles to call a function? Right, you need to hand update your code or it'll break again. In addition, this is not a function level construction, it must be written inside a function so you'll still have the overhead of calling this function.
The high quality trick however is much better and less verbose:
template <int N> struct SleepCycleWriter
{
inline static void sleep() { sleepy_cycle(); SleepCycleWriter<N-1>::sleep(); }
};
// End the recursion with an empty function
template <> struct SleepCycleWriter<0> { inline static void sleep() {} };
Here, anytime you instantiate SleepCycleWriter<N>::sleep()
, the compiler recursively emits N sleepy_cycle()
.
You need 400 cycles? No problem.
No function calling overhead whatsoever, since all the sleep()
function are inline.
Ok, everything seems fine now. But how do you know if a variable is a compile time or runtime ? How do you act differently ?
In C++ (version 2014 that Marlin is using), it's not possible to react differently on compile time or runtime... unless using a bunch of non-standard tricks. This is because a compile-time variable can be used in a template parameter, but a runtime variable can not. This is incorrect C++:
int x = 10; SleepCycleWrite<x>::sleep();
So unless doing an ugly:
switch (cycle_count) { case 10 : SleepCycleWriter<10>::sleep(); break; ...}
There is no standard optimal solution (even in C++20... yet).
To solve this issue, I'm relying on a compiler extension (called __builtin_constant_p
). This magical symbol check if the given parameter is a compile time value (a constexpr
in C++'s guru terms) but does not fail if it's not (unlike if constexpr
in C++17 and later).
It also provide a nice advantage, when used like this:
constexpr int o = __builtin_constant_p(count) ? count : 0;
In that specific case, I'm building a constant expression (a compile time value) from a possibly runtime value: this is the given count if the count is a compile time value or the constant 0.
The magic is in the ? count
part here, since if count
is not a compile time value, the operand is not evaluated and does not generate an compile time error (unlike if constexpr
statement).
After that, everything is pretty straightforward, o
is a compile time value that can be used in the template parameter. We just need to specialize some template on this parameter so if it's 0 (or -1 or whatever, just need to be consistant here), we fallback to a runtime call of the loop function.
Well, everything boils down to results anyway. Why bother with all this mess if it does not give excellent results ?
In fact, it does:
Calling the DELAY_CYCLES macro directly for 1cycles took: 1cycles
Calling the DELAY_CYCLES macro directly for 5cycles took: 5cycles
Calling the DELAY_CYCLES macro directly for 10cycles took: 11cycles
Calling the DELAY_CYCLES macro directly for 20cycles took: 20cycles
Calling the DELAY_CYCLES macro directly for 50cycles took: 49cycles
Calling the DELAY_CYCLES macro directly for 100cycles took: 102cycles
Calling the DELAY_CYCLES macro directly for 200cycles took: 207cycles
Calling the DELAY_CYCLES macro directly for 500cycles took: 497cycles
Less than one cycle of error for small delay and less than 7 cycles of error for large delay.
You can see by yourself in this godbolt