One thing I've been unhappy about with my ports of Inferno to Thumb-2-based microcontrollers is the lack of floating point (FP) arithmetic support so far. This isn't because I want to do mathematics or perform complex calculations quickly. It's because a lot of Limbo modules use the real type to represent numbers just because it's more natural for many calculations.
When doing anything related to Inferno ports these days, the first port of call is the Raspberry Pi port and the series of labs that document how it was done. The most relevant one for us in this case is Lab 26, floating point which makes it sound almost trivial to implement support for floating point instructions. Unfortunately, the instruction set used by the original Raspberry Pi is not the same as the one we're using for the ports to Thumb-based microcontrollers. That's why we're using the tl linker instead of 5l, for example. So, I anticipated some problems ahead of time. Just not necessarily all the ones I encountered.
The other required background reading is the ARMv7 Architecture Reference Manual. This shows the instruction encodings for the floating point instructions that Thumb-2 microcontrollers can support.
Most, if not all of the available Cortex-M4 microcontrollers available at the moment only support single precision floating point arithmetic. This means that, not only are they limited in the range of operations they can perform, they do not recognise many instructions that handle double precision numbers. Actually, there are instructions that involve loading and storing doubles into FP registers that are implemented, but the ones that do arithmetic aren't recognised.
If you feed one of these unrecognised instructions to the microcontroller, it raises an exception which you need to trap. Typically, what you get is a Usage Fault that reports an Undefined Instruction. When this happens, you need to find where the exception occurred, check what the instruction was at that address by decoding it, then emulating it if it's one that you need to support.
The Raspberry Pi port uses an existing implementation of an FP instruction emulator called fpiarm.c, which relies on code in the fpi.c and fpimem.c files in the os/port directory. This does some undefined instruction decoding and emulates floating point instructions using integer arithmetic, using a data structure called Internal that represents the different fields of a floating point number.
For Thumb-2, we want to use the native FP instructions as much as possible, so we can't really maintain a separate set of per-process, emulated registers as the fpiarm.c implementation does. Instead, we want to operate on the FP registers themselves when handling an undefined FP instruction. This doesn't mean that the functions in fpimem.c to handle Internal values are redundant, however. They are still needed to translate between 32-bit integers, doubles and the internal representation used to perform calculations.
Another issue that complicates matters is the representation of floating point numbers in memory. The predefined endianness for Thumb in the Inferno/thumb/include/lib9.h file indicated big-endian FP doubles, but experimentation indicates that Thumb-2 uses little-endian doubles. This mismatch caused problems for code that accesses doubles in memory, such as libmath/dtoa.c and other places in libmath, such as libmath/fdlibm/fdlibm.h. The compiler also needs to know the correct endianness of doubles.
I've previously copied code from 5l into tl, as well as adding my own code, to handle FP instructions. The annoying thing about using Thumb-2 FP instructions is that they are mostly the same as regular 32-bit ARM instructions except that the order of the upper and lower 16 bits are swapped. This led me to specify all the instructions in a different way to those in 5l, which doesn't help when comparing the code in the two tools. Also, the 32-bit ARM instructions include conditional execution fields that are redundant in Thumb-2 code – all instructions are unconditional.
I was initially a bit confused by what tl understands by the floating point registers it refers to as F0 to F7. Very quickly, I realised that it operates on double-sized registers, D0 to D7 in ARM terminology. The way that single and double-sized register numbers are encoded in instructions is sort-of-intuitive in that the index of a double-sized register is implicitly encoded as the index of the first float-sized register in a pair of FP registers. For example, D2 is encoded as S4, and occupies S4 and S5.
The way the compiler emits code means that a regular register is set aside for temporary values within the collection of instructions for each operation. I included a definition for a corresponding temporary FP register, though I don't think that tc needs or uses it, unlike tl which does.
One other problem that appeared was related to the use of FP registers by the compiler. This was observed when using the calc tool, and could be traced to the __ieee754_exp function in the libmath/fdlibm/e_exp.c file. Basically, it seems that the compiler uses an FP register to hold the contents of a variable but doesn't take into account that the variable is also modified via memory operations. The workaround was to use tc's -N option to prevent “registerization” for this code, keeping accesses consistent. I may write more about this in the future, depending on any insights others might have.
Update (2023-10-13): Richard Miller diagnosed the problem as a compiler bug and suggested a workaround. The discussion thread can be found here.
One other aspect of all this is how exceptions are handled in the first place. I already handle the SysTick exception to perform time slicing, which is quite an ordeal in itself. Handling Usage Fault exceptions is a little different because the fault needs to be cleared if the fault was handled successfully, but the emulated instructions also need to modify the values of registers on the stack.
Without floating point enabled, the microcontroller pushes a subset of the regular registers onto the stack when an exception occurs, and pops them off the stack if or when the exception returns. When FP is enabled, it automatically stacks and unstacks all the available FP registers. This can be configured to be “lazy” but let's not get ahead of ourselves. Instead of storing FP registers in the FPenv struct for each process, we leave them on the stack. I don't think we have a choice about this: with FP enabled, this happens automatically. It might be possible to disable FP completely and trap all FP instructions, then either emulate all of them or enable FP selectively for those that the hardware can execute. It seems like a lot of work for something that could easily break if not done carefully.
The Thumb-2 instructions I use to push and pop registers on and off the stack are documented as being unpredictable in certain cases. This wasn't something I'd noticed, even though I was using one of the cases that could be problematic. However, I did find another case where the stacked registers were not as they should be, and I can't explain why it occurs. Adding another register to the list, and making room for it in the struct used to reference stacked registers, seems to have made the problem go away.
So far there has been a lot of work just to get started, and certainly more than I expected. Some level of floating point support is something that is expected to just work, so it's always disappointing to find things that are incorrect. In any case, we can try to run a few things and look for errors:
apollo3$ calc solve(x**2-5*x+6==y**3+z, x) 2 3 3 differential(x=1, x*x+5*x+6) 7 integral(x = -1.96, 1.96, exp(-0.5*x*x)/sqrt(2*Pi)) .9500042096998785
Performance doesn't appear to be great, but we don't have a JIT compiler to fall back on. Running calc the first time can cause the main memory pool to grow enough to prevent it from being run again. This doesn't appear to be a recurring memory leak: if it can be run again, it doesn't cause the pool to grow any further.
Apart from the register management issue with tc, the next steps will be to test more FP-dependent programs, add some tests to a test program I'm using, and see if the compiler can be optimised more generally to reduce the number of NOOPs it produces.
The work-in-progress floating point handling code can be found in the apollo3-fp branch of this repository. Look in the os/apollo3/fpithumb2.c file for a sample of some of the code involved.