axle’s red rectangle of doom

Reading time: 4 minutes


Over the course of developing an operating system, things are going to crash, and they’re going to crash a lot.

One way to make crashes slightly less annoying is to isolate them: the rest of the system continues running, and you can poke around further after an unexpected condition arises in one process. Perhaps surprisingly, it took me several years before this became feasible!

Previously, when any code path, in any process, encountered an error condition (such as a page fault, or an explicit assert()), the kernel would spew some debug information over the serial port, then lock up to prevent further shenanigans.

This is pretty overzealous and often unnecessary. Even for exceptions triggered in kernel code, it’s often perfectly acceptable for just the responsible process to be terminated, and to allow the rest of the system to continue on its way.

In late 2021, I added a crash reporter application that reports failures in most processes. The mechanism works for both CPU-level faults (which trap to kernel-space) and voluntary aborts (which technically trap too, since they happen via a syscall).

Note
More correctly, faults fault to kernel space. Faults return execution to the faulting instruction once the kernel’s fault handler completes, whereas traps proceed from the instruction following whichever one caused the trap. The former is useful, for example, if the kernel page faults because some memory is backed by lazily loaded data. After loading the data, the instruction that tried to access the memory should try running again. The int instruction, which is the mechanism used by syscalls, triggers a software-initiated fault, and continue execution after the int instruction (i.e. after the syscall returns).

Not all failures can be routed to the crash reporter, though. Three services are vital to the crash reporting flow. If any of them crash, the system must lock up.

  • com.axle.crash_reporter
    • If the crash reporter itself crashes, we definitely don’t have recourse for displaying a crash report.
  • com.axle.awm (axle window manager)
    • We need the window manager to display the crash reporter’s window.
  • com.axle.file_server
    • The file server is required to launch the crash reporter.

If one of these critical services dies, you get the axle red rectangle of doom (patent pending).

Crash reporting

Each kernel code path that handles an implicit task failure, such as the page fault handler, ends in a call to task_assert(), which includes a user-facing description of the error.

The syscall for a voluntary abort, called by libc::assert(), is just another wrapper over task_assert().

The machinery will then decide whether the crashing process is important enough to warrant the fuss of a full lockup, or if we can show the interactive crash reporter instead.

If we’re able to use the interactive crash reporter, we first ask the file server to launch it if it’s not already running. We then build up a crash report consisting of the registers at the time the fault was raised.

Note
This is one neat benefit from the fact that this code path is shared by voluntary syscalls to assert(). Since syscalls are built upon traps (another type of interrupt), the interrupt handling mechanism will capture register state as though this code was invoked through a fault handler.


The crash reporter needs the dead process to host an amc service to do its work, so if the process isn’t hosting one at the time of death, the crash reporter will anoint it with an auto-generated service name like com.axle.corpse_service_PID_{id}_time_{ms}, as a sort of quiet eulogy.

We then symbolicate the return addresses we see on the stack. Each time the kernel launches an ELF, the kernel stores its symbol table. The kernel also stores its own symbol table, so that we can symbolicate both kernel and user-binary symbols when reporting crashes.

Completing the kernel’s work, the crash report is sent to the crash reporter application.

The crash reporter itself is pretty straightforward: it listens for crash reports, and renders them to a text view (in green text, since we’re 😎hacking😎). Like all services, it defines structured message types that the kernel needs to conform to.



Newsletter

Put your email in this funny little box, and I'll send you a message when I post new stuff.