Screwing up my page tables

Reading time: 20 minutes



Note
This post was discussed further on Hacker News.

Recently, the epic yak shave that is my life dragged me down into oblong pits of page table corruption. I’ve made it out the other side, with a primary takeaway of “how the hell did this ever work?”, followed closely by “what the hell is still hiding?”.

Picture this: you’re carefree, healthy, “hey, I should rewrite axle’s network stack in Rust!”. Sinister snap of the violin. Sad oboe, crying clown.

OK, this can’t be too bad. Let’s start with the basics: we’ll fire up the network card driver to jog our memory about where we left things off.

Right off the bat, we can see the pointer fields are expecting a 32-bit address space. I clearly never updated this component after upgrading axle to support x86_64.

Hopefully this component won’t just silently run into bugs on x86_64, though? Let’s keep looking.

Ah ha! Thanks past-me. This driver wants to call amc_physical_memory_region_create(), but I left myself a quick note that this code path needs another look. amc_physical_memory_region_create() allows the caller to allocate a region of physically contiguous memory, and I can see that I removed this capability from axle’s physical memory allocator sometime past. I inserted assertions so that any code path that tries to allocate a physically contiguous regions will alert me, upon execution, that the time has come for more physical memory allocator work.

Well, the time has come for more physical memory allocator work. The current design, implemented in C, just maintains a stack of free physical frames. It’s a bit difficult to identify and allocate contiguous regions with just a stack for storage, and adjusting our design will make things a lot easier for us. Let’s reserve a region for contiguous allocations upon startup, and allocate from it separately.

However, I can’t be bothered to do it in C.

Rust to the ruscue

Following my new SOP, I decided to rewrite my kernel-module-du-jour in Rust, and make further improvements on the module from there.

No worries, this module should be pretty simple! Let’s get started by making just a one-to-one replacement to the C API. Then, we can add our contiguous chunk pool once we’re sure the Rust-based module is working as intended.

The interface to the physical memory allocator is pretty straightforward, as is its work:

  • Upon boot in pmm_init(), read the UEFI-provided memory map to discover physical memory regions.
  • For all the regions that are marked in the memory map as ‘usable’, excluding specific frames we intentionally exclude from the physical memory allocator, track the frames in the allocator’s free-frames stack.
  • When an allocation is requested via pmm_alloc(), pop a free frame from the free-frames stack.
  • When a frame is freed via pmm_free(), push the frame back onto the free-frames stack.
Note
axle occasionally needs to use specific and fixed physical frames, such as while booting cores during SMP bringup, so these frames need to be ignored by the allocator.

This can be implemented in well under a couple of hundred lines, and there’s nothing snazzy about it. Imagine my horror, then, when I turned on the new Rust-based allocator and started getting page faults during boot!

Bug #1: Buffer overflow in SMP bootstrap setup

Debugging memory bugs in kernel-space is never fun, and at the worst of times can be an outright nightmare. Timing and different sequencing of events during testing sometimes preclude bugs from manifesting, making it difficult to isolate whatever is going wrong.

We know that the bug doesn’t manifest when using the C allocator, and does when using the Rust allocator. Maybe these allocators have slightly different allocation patterns?

That’s interesting! The C allocator operates in LIFO: it enqueues freed frames to the top of its stack, such that the next allocation yields whatever frame was last freed. The Rust allocator, by contrast, operates in FIFO: freed frames are pushed to the back, and we’ll only see them again once we exhaust all the frames we enqueued at the beginning.

Note
Last in, first out and first in, first out.

Although noteworthy, the difference in behavior isn’t surprising: in the C allocator, I’m using a small handwritten stack, which naturally has LIFO semantics. In the Rust allocator, I’m using heapless::Queue, which is FIFO by design. The difference in access pattern isn’t a problem in-and-of itself, but the fact that different frames are now being used for early-boot structures is causing some deeper bug to crawl out the shadowy depths.

I started hacking at the codebase, lobbing off chunks of functionality left and right, haphazardly disabling subsystems within which the bug could nest. Eventually, I managed to isolate a reproducing case to within SMP bringup.

As a refresher, each time we boot another core on the machine during symmetric multiprocessing bringup, the core will invariably start booting in 16-bit mode, and we’ll need to manually help it go through the motions of enabling 32-bit and 64-bit mode (or ’long mode’). Booting into 16-bit mode imposes some constraints: particularly, we need to have some code at a known physical memory address below the 1MB mark that the core can begin executing from. In axle, frame 0x8000 is used for this purpose, and is referred to as the SMP bootstrap code frame. axle places a hand-coded assembly routine into this frame that performs the long walk to long mode.

Note
Detailed further in Running axle on multiple CPUs.
Note

To make the transition to 64-bit mode more convenient, several of the data structures that this program needs are placed into another special frame, 0x9000, which axle denotes the SMP bootstrap data frame. Specific offsets within this frame hold pointers to different data structures that the bootstrap program needs while transitioning to 64-bit mode, such as:

  • A 32-bit GDT
  • A 64-bit GDT containing ’low memory’ pointers
  • A 64-bit GDT containing ‘high memory’ pointers
  • A 64-bit PML4
  • A 64-bit IDT
  • A high-memory stack pointer
  • A high-memory entry point to jump to once the bootstrap program completes

While setting up the SMP bootstrap data frame, the kernel will create these structures, and copy the pointers to their expected locations within the data frame. For example, here’s how the kernel sets up the C entry point address:

Here’s how the kernel sets up the IDT address:

Wait, what? We’re copying sizeof(idt_pointer_t) + current_idt->table_size bytes, but no one was expecting us to copy the whole table! We’re only meant to be copying a pointer to the table.

Double-checking the bootstrap program, we definitely didn’t expect this; the bootstrap program just expects a pointer, not the full table data.

Having a closer look at current_idt->table_size, we find that this table occupies 0xfff bytes. This means that our flow above will write outside the allowed frame, which ends at 0xa000, overwriting whatever’s occupying 0xa000 to 0xa408 with the back half of the kernel’s interrupt descriptor table. This overflow didn’t cause any visible errors with our previous frame allocation pattern, so it didn’t come up before. However, with our new allocation pattern, this overflow is overwriting data in a way that leads directly to a page fault. Let’s fix the overflow:

Lesson

Changing our frame allocation strategy, an access pattern detail that ideally shouldn’t drastically impact system behavior, caused several low-level memory bugs to come to light. Maybe we could intentionally randomize our frame allocation patterns, so we can intentionally search out bugs in this class. We could also allow setting a seed for this randomization, to enable reproducible scenarios.

One bug squashed, let’s keep going!

Bug #2: Overwriting paging structures due to a truncated pointer

We’ve progressed a bit further into boot, but now we see something really odd. At some point, we get a page fault. However, the faulting address points to a location that should obviously be a valid memory location, such as bits of the kernel that are mapped into every address space. Furthermore, this page fault is much more resistant to this slice-things-off-until-the-bug-remains approach that we used above. The exact location of the page fault seems to jump around between different runs. We’re going to have to solve this quick, as if I bite my fingernails any shorter I’ll lose my elbows.

I fire up gdb and get to work tracking down what’s going wrong. Eventually, I set a breakpoint in _handle_page_fault() and investigate the virtual address space state.

Here’s what a normal virtual address space layout looks like for an early-boot snapshot:

All the ranges here match our expectations, and are further specified in memory_map.md:

  • 0x0000000000000000 - 0x0000001000000000: Early-boot physical RAM identity-map
  • 0xffff800000000000 - 0xffff801000000000: High-memory physical RAM remap
  • 0xffff900000000000 - 0xffff900000045000: Kernel heap
  • 0xffffa00000000000 - 0xffffa00000001000: CPU core-specific storage page
  • 0xffffffff80000000 - 0xffffffff8221a000: Kernel code and data

The virtual address space state once our page fault breakpoint triggers:

That looks… terrifying, not to mention nonsensical. What are all these regions? Why are they in low memory? Why does the size of one region correspond to the base address of the next region? What am I doing with my life?

Things start to feel a lot more manageable once you realize that the nonsensical address space mappings we’ve listed above might be the result of our paging structures themselves being overwritten due to incorrect memory accesses. If you overwrite your own description of how the virtual address space should be laid out, you can bet you’re going to get weird violations on addresses that should obviously be valid.

This really takes the sting out of these scary faults, and now we can get to work tracking down the exact subsystem responsible for the corrupted page tables.

I notice something supremely odd: when creating the structure backing a new process, the kernel heap appears to be returning addresses in very low memory, rather than addresses above 0xffff900000000000 (the base of the kernel heap).


Okay, this is starting to make sense!

  • For ~some reason~, allocating a new task block returns an address in low memory.
  • When we populate this task block, we overwrite this low memory.
  • Paging structures are some of the first frames allocated by the PMM, and therefore reside in low memory.
  • Paging structures get overwritten with contextually bogus data, resulting in mega-invalid virtual address spaces and page access violations at addresses that should be valid.

The big question now is, why is the kernel heap returning addresses in low memory, rather than addresses within the kernel heap?

After poking around within gdb further, it appears that the return value from kmalloc() is a valid pointer within the heap. This pointer is stored in $rax, then… is truncated by the next instruction?! Let’s take a look at the assembly.

Hmm, I’m not familiar with cdqe. A quick search reveals that it sign-extends the lower 4 bytes into the full 8 bytes. Ah ha! A flash of insight illuminates my grasping waddles. Could it be that kmalloc() is implicitly defined at the call site, so the compiler assumes the return type is a 32-bit int, chucking out the top half of the pointer value?

Indeed, this is exactly what’s going on. A simple #include at the top of task_small.c fixes this right up, and no more page table corruption for us.

I’ll finish this section with the unedited notes I took as I debugged this.

Lesson

Warnings are useful, and an appetite for ignoring them will wind you up with a bellyful of pain. I enabled -Werror=implicit-function-declaration to ensure that implicitly defined functions are always a hard error, as this is evidently very dangerous: any call site invoking an implicitly defined function, expecting to receive a pointer back, is going to have a very bad time. I then cleaned up all the implicitly defined functions in the kernel.

Also, it’s not great that producing an invalid low pointer causes our paging structures to get overwritten, as these structures are essential to maintaining a system with some semblance of coherence, and their corruption makes debugging way more confusing, challenging, and unpredictable. Userspace programs generally don’t map the bottom region of the address space (or map it as inaccessible). This so-called ‘NULL region’ is very useful for debugging: it means that if a bug ever leads you to accessing low memory, you’ll get a quick crash instead of overwriting important state. This is why you get a ‘segmentation fault’ when dereferencing a NULL pointer!

Since we’re working in the context of physical memory, we don’t have the luxury of mapping frames however we like. If we want to exclude the lowest section of the address space, we have to ‘waste’ that RAM: we can’t use it for anything else. Maybe this is a worthwhile tradeoff. I’m imagining marking the lowest few frames as a ‘scribble region’: all zeroes to start, with some kind of background job that watches for if any non-zero data pops up in these frames. If it does, we know that some part of the system has accessed an invalid pointer. Since the process of accessing an invalid pointer is only possible while the kernel’s low-memory identity map is active, we could free this scribble region as soon as we disable the low-memory identity map, and give the reserved frames back to the physical memory allocator.

Surely we’re done..?

Bug #3: Overwriting paging structures due to bad pointer math in allocator

I think I’m going to throw up.

We’ve got another bug in which the paging structures in low-memory are being overwritten while cloning a virtual address space. Let’s take a look at our allocation pattern again…

Oh, gnarly! At first glance, our PMM is double-allocating frames; see how 0x0000000000001000 is allocated once near the beginning, then again much later on?

At closer look, the PMM starts this double-allocating as soon as the frame address ticks over to 0x0000000100000000 (ie. above 4GB). This triggers my spidey-senses that something, somewhere, isn’t handling 64-bit addresses correctly.

Indeed, I blithely translated some old code into Rust:

I felt pretty silly when I spotted this one. I’m actually casting off all the bits above u32::max, since the mask I’m applying is 0x00000000fffff000. Therefore, any addresses above 4GB that are enqueued get truncated to their low-memory cousins. When these truncated frame addresses get allocated, I’m telling each allocator user “no really, go ahead and write to 0x8000, it’s yours!”. This frame is already being used by something else in the system, and chaos ensues.

The fix here is really simple! I moved away from the ’enable absolute bits’ approach to a ‘disable relative bits’ approach, which automatically masks to the correct addresses regardless of the target’s word size.

Bug #4: Top portion of address space is missing due to bad math in page mapper

This one was weird. Over the course of writing the contiguous chunk pool in the new Rust-based PMM, I introduced a new piece of global state to manage the pool.

Since this data is declared static, the storage for this structure will be reserved when the kernel is loaded into memory by axle’s UEFI bootloader.

Note
A colloquial way of saying this is that “the structure will be stored in the .bss binary section”. This isn’t strictly accurate. The .bss section instructs the loader to populate the virtual address space with a zero-filled region at the specified address, but there’s no storage space reserved in the binary itself. Since the data will only exist at runtime anyway, this is a nice trick to ‘inflate’ the storage space that a binary will use upon loading, without paying that storage cost in the binary itself.

As it happens, the build system ended up placing this structure at the very top of the address space reserved for axle’s kernel.

But when this code runs, we get a page fault when trying to access this structure.

Huh, that’s odd. The bootloader sets up memory regions based on whatever is described by the kernel binary:

When we run this, the segment containing our PMM’s new data structure appears to be mapped correctly.

And yet, we get a page fault upon access! What do we see when we interrogate the actual state of the virtual address space, disregarding out intention for what the address space should look like?

Yeah, that’s it! We’re intending for data to be mapped into the virtual address space at 0xffffffff82217fff, but, for some reason, our virtual address space stops at 0xffffffff82200000. When we try to access a data structure that happens to be stored in this bad region, we get a page fault.

But why are we missing the top region of the address space? A likely suspect for investigation is the logic which sets up a virtual address space mapping based on input parameters, and indeed this is where our problems lie.

The logic within the bootloader’s that maps regions of the virtual address space looks roughly like this:

The logic above essentially has two jobs, which are being interleaved:

  1. Map the requested virtual address space region
  2. Identify which paging structures to use to map each frame

The loops above work just fine when all the paging tables we’re accessing along the way are empty. However, if one or several of the tables we’re adding data to is already partially full, the above loop won’t populate everything we requested. It’ll calculate how many tables it’d take to satisfy the request if all of those tables were empty to begin with. If some of those tables were storing some virtual address mappings to begin with, though, we won’t map as much memory as was requested, and will silently produce an unexpected virtual address space state!

The fix here is conceptually straightforward: rather than trying to compute exactly how many structures we’ll need to iterate upfront, let’s just allocate pages one-by-one until we’ve satisfied the map request, and look up the correct paging structures to map each page on the fly:

Bug #5: Transitive heap allocations cause crashes in the PMM

Over the course of debugging the Rust-based PMM, I naturally printed out some state at various interesting locations:

Upon execution, though, I would get surprising failures about an exhausted virtual address space.

Taking a peek into the virtual address space mappings like we learned above, we can see telltale signs that things have definitely gone awry.

As it turns out, the issue here is that the println!() macro transitively kicks off a heap allocation.

Performing a heap allocation within the physical memory manager is definitely not allowed, as this inverts the dependency tree. The physical memory manager sits at the bottom of the stack, and the virtual memory manager sits above that. The kernel heap is just a client of the virtual memory manager, which requests and frees chunks of virtual address space. In practice, trying to perform heap allocations from the physical memory allocator causes corrupted virtual address space state, and, as a result, confusing crashes. The fix for this particular case is to invoke axle’s C-based printf function directly via FFI, rather than going through the formatting machinery. Of course, this means I need to do a bit of work by hand, such as remembering to null-terminate Rust-originating strings that I pass.

I do wish Rust provided some way to express and enforce that a particular module isn’t allowed to allocate!


Newsletter

Put your email in this funny little box, and I'll send you a message when I post new stuff.