Syscalls, what gives?

26 Aug, 2022 Reading time: 8 minutes

Syscalls are a fundamental piece of the processes model within contemporary operating systems.

OS’s generally like to provide the abstraction that a given program is running on the CPU, linearly and largely in an uninterrupted fashion, from start to finish.

Of course, this isn’t the case; programs are interrupted all the time, for a variety of reasons. Some examples of times the kernel needs to step in:

The kernel needs to handle an event from a peripheral device
The program has page-faulted and needs the kernel’s VMM to make everyone play nice
The program has been preempted to give other programs the chance to use the CPU for a few milliseconds

The underlying reality of what the CPU spends its time on is a complex juggling act of contexts scheduling in and out. And yet, each program retains the illusion that they have access to a pure CPU resource.

This handy illusion comes at a complexity cost: the kernel has to do a delicate dance to maintain this. As an example, the kernel needs to save the contents of all CPU registers each time it preempts a process, so it can restore them when it points the CPU back to the task again.

All that said, this illusion isn’t complete: in many key places, operating systems allow themselves to expose the fact that programs are running in a managed environment. There are certain tasks that a program cannot complete with just a CPU resource dutifully chugging along. For example, file I/O and creating new processes clearly need guidance by the benevolent hand of the supervisor. For this, operating systems provide the concept of system calls. Taking these two cases as an example, we can point out the read() and fork() syscalls.

While most function calls within a program are ’local’ procedure calls – the code being branched to is located within the program’s address space^✱ – system calls are a form of remote procedure calls. When a system call is invoked, code outside the program itself receives control, fulfills the request, and returns control to the caller program^✱.

✱ Note

This isn’t the dichotomy I’m making it out to be: most operating systems structure their virtual memory layout such that the kernel, and all its memory, is mapped into every process, with some paging bits set such that this memory isn’t allowed to be accessed by user-mode code. This makes things pretty convenient for the kernel in a number of ways. However, the de-facto standard is rapidly changing in response to attacks such as Meltdown.

✱ Note

Well, sometimes. Some syscalls, such as exit(), will cause the process to be terminated entirely, in which case the kernel certainly won’t be doing any resuming.

Syscalls, how do they work?

Okay, we’ve got two kinds of calls:

‘Local’ procedure calls: This is our familiar function call, say to printf(), in which control is transferred to some other part of the program.
‘Remote’ procedure calls: These are our syscalls, in which control is transferred to the kernel.

All types of calls typically need to do a few things:

Pass arguments to the callee
Receive a return value from the callee
Remember where we came from, so we know where to return to when the callee completes
The callee mustn’t overwrite CPU registers or stack memory that’s in use by the caller

Calling convention

‘Local’ procedure calls accomplish this by way of an agreement. If everyone does the same song-and-dance when calling another function, and when selecting which registers to overwrite during their own execution, then we get a nice system in which we can call other pieces of code in our program and return from them, and everything lives in harmony without overwriting each other’s data.

This agreement is an important part of an ABI, or application binary interface. More specifically, we’re focusing now on the ABI’s calling convention.

We could imagine a calling convention like the following. We want to talk about syscalls, though, so we’ll do the bare minimum to keep the nerds at bay:

void caller(void);
void callee(int arg1, int arg2);

caller:
    # Set up a stack frame for ourselves
    push rbp
    mov rbp, rsp
    # ...
    # Push the arguments to `callee` onto the stack
    push 123
    push 456
    # Call the function
    call callee
    # Return value is stored in rax
    #  ...
    # Tear down our stack frame
    mov rsp, rbp
    pop rbp
    ret 

callee:
    # Set up a stack frame for ourselves
    push rbp
    mov rbp, rsp
    # Our arguments are stored at stack offsets
    # Do some computation, storing the result in rbx
    add rbx, [rbp - 8], [rbp - 16]
    # Put the return value in rax
    mov rax, rbx
    # Tear down our stack frame
    mov rsp, rbp
    pop rbp
    ret

Here, we pass arguments to callee’s on the stack, return values in the rax register, and each function preserves the stack frame of its caller.

Calling convention 2: Electric Boogaloo

Our syscalls will need their own calling convention.

Firstly, we’ll need some way to invoke the syscall at all. User-mode code exists in a lower protection level than the kernel, which means that it’s impossible for user-mode code to directly invoke kernel-mode code via something as familiar as the call instruction.

Most operating systems use a trick here.

CPUs normally execute streams of instructions. But, every now and again, the CPU needs to jump somewhere totally unrelated, not due to a branch instruction within the code itself. This comes in a couple different flavors:

CPU interrupts
CPU exceptions

An interrupt is an event indicated by a high bit in the CPU’s interrupt line. Before the CPU executes each and every instruction, it’ll check whether any bits in its interrupt line are set. If they are, the CPU will divert its attention from the very important pixels you were rendering onto the device that needs wrangling. CPU interrupts are typically the results of other hardware on the bus doing something, such as a hardware timer firing, or a disk completing a fetch. When an interrupt is raised, the CPU will jump to the interrupt handler.

An exception is thrown by circuitry internal to the CPU when the code the CPU was previously running has encountered some condition that needs further handling. For example, if the code the CPU is running tries to divide by zero, or execute an invalid opcode, the CPU will generate an exception and jump to its handler.

But how does the CPU know where to jump to? For each of these examples, there exists some code in the OS kernel that’ll kill the process that divided by zero, or that’ll kick off the disk driver to do some work. How does the CPU know where to go?

The IDT

Of course, the CPU doesn’t inherently know, and telling the CPU where to go in various circumstances is one of the responsibilities of the kernel. The exact data structure (on x86_64) is obscure and annoying, but the high-level idea is that the kernel creates a data structure in memory that expresses something like the following:

{
    // ...
    PAGE_FAULT: _handle_page_fault,
    DIVIDE_BY_ZERO: _handle_divide_by_zero,
    INVALID_OPCODE: _handle_invalid_opcode,
    INTERRUPT0: _handle_interrupt0,
    INTERRUPT1: _handle_interrupt1,
    INTERRUPT2: _handle_interrupt2,
    // ...
}

The kernel then loads the address^✱ of this structure into a special CPU register, so it knows where to look.

✱ Note

The physical address, if you’re counting.

When the CPU experiences an interrupt or exception, it’ll consult this table to find where to jump to. This is how the kernel is able to respond to CPU-level events.

This data structure is called the interrupt descriptor table, or IDT. Why ‘interrupt’ if exceptions are included too? x86 is bad at naming things.

Eyes on the prize: we want to learn about syscalls.

While the exceptions that we’ve been looking at so far have been implicitly generated by software (particularly when said software does something a bit naughty), x86 also provides a way for software to explicitly generate an exception. This is done via the int instruction, which takes a single operand: the index into the IDT of the exception type that should be triggered.

But… does that mean that any user program is able to generate a CPU-level event that makes it look as though, say, a disk event has just occurred? Not quite, as another bit within the IDT instructs the CPU as to whether code running in unprivileged protection levels is allowed to trigger it^✱.

✱ Note

An attempt to do this would trigger a security exception. The code would therefore trigger an interrupt, just not perhaps the one it wanted =)

Ok, syscalls?

So, operating systems typically select a ‘syscall vector’, or specific index into the IDT, that’s specially designated for programs to initiate RPCs into the kernel. This index into the IDT is the operand that’ll be used with the int instruction; to demonstrate visually, if our syscall vector is 128 (a common, and mostly arbitrary, choice among operating systems), a syscall will be invoked via the following assembly instruction:

int 0x80

Of course, we’re missing some important information here! int $0x80 only tells us that we’re going to be invoking a syscall, but it doesn’t say anything about which syscall we want. For that, we need to describe the syscall calling convention.

Similarly to how we need to come up with some methodology for passing arguments when invoking ’local’ function calls, we need to do the same for syscalls. Traditionally, and in axle, the ‘syscall vector’ will be placed into $eax. In other words, the program must place a value in $eax that specifies what syscall it’d like to invoke, before invoking the syscall interrupt via the int $0x80 instruction.

For example, write() might be assigned syscall vector #12, and so, to invoke write(), a program might contain assembly like the following:

mov eax, 0xc
int 0x80

Note that eax operates as a sort of opcode, selecting from a table of function pointers in a similar way to how the argument to int selects from a table of interrupt types.

In axle, IPC is achieved through the amc_message_send syscall, which uses the syscall vector #1. The signature is something like the following:

void amc_message_send(const char* dest_service, void* payload, usize payload_len);

Of course, the receiver of the message is going to want to interpret the payload it’s received, and so most payloads normally contain a discriminator as the first field:

typedef struct window_moved_event {
    uint32_t message_type;
    usize window_id;
    Point new_window_pos;
} window_moved_event_t;

Note that we’ve now got three levels of discriminators happening!

The vector passed to the int instruction, selecting which interrupt to invoke
The vector passed in eax, selecting which syscall to invoke
The discriminator passed in the first u32 of the payload, describing what kind of message is being sent

Computers are a layer cake.

Phillip Tennen

Syscalls, what gives?

✱ Note

✱ Note

Syscalls, how do they work?

Calling convention

Calling convention 2: Electric Boogaloo

The IDT

✱ Note

✱ Note

Ok, syscalls?

Newsletter