{
	"id": "f616864e-0df2-47ed-9a0e-d89b1813c11c",
	"created_at": "2026-04-06T00:19:58.801173Z",
	"updated_at": "2026-04-10T03:20:39.285196Z",
	"deleted_at": null,
	"sha1_hash": "e4c6433e2494d4c2f24e4f4595ed4aaf9ae675b0",
	"title": "Anatomy of a system call, part 2",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 197198,
	"plain_text": "Anatomy of a system call, part 2\r\nBy July 16, 2014 This article was contributed by David Drysdale\r\nArchived: 2026-04-05 16:14:17 UTC\r\nLWN.net needs you!\r\nWithout subscribers, LWN would simply not exist. Please consider signing up for a subscription and\r\nhelping to keep LWN publishing.\r\nThe previous article explored the kernel implementation of system calls (syscalls) in its most vanilla form: a\r\nnormal syscall, on the most common architecture: x86_64. We complete our look at syscalls with variations on\r\nthat basic theme, covering other x86 architectures and other syscall mechanisms. We start by exploring the various\r\n32-bit x86 architecture variants, for which a map of the territory involved may be helpful. The map is clickable on\r\nthe filenames and arrow labels to link to the code referenced:\r\nx86_32 syscall invocation via SYSENTER\r\nThe normal invocation of a system call on a 32-bit x86_32 system is closely analogous to the mechanism for\r\nx86_64 systems that was described in the previous article. Here, the arch/x86/syscalls/syscall_32.tbl table\r\nhas an entry for sys_read:\r\n 3 i386 read sys_read\r\nhttps://lwn.net/Articles/604515/\r\nPage 1 of 8\n\nThis entry indicates that read() for x86_32 has syscall number 3, with entry point sys_read() and an i386\r\ncalling convention. The table post-processor will emit a __SYSCALL_I386(3, sys_read, sys_read) macro call\r\ninto the generated arch/x86/include/generated/asm/syscalls_32.h file. This, in turn, is used to build the\r\nsyscall table, sys_call_table, as before.\r\nMoving outward, sys_call_table is accessed from the ia32_sysenter_target entry point of\r\narch/x86/kernel/entry_32.S. However, here the SAVE_ALL macro actually pushes a different set of registers\r\n(EBX/ECX/EDX/ESI/EDI/EBP rather than RDI/RSI/RDX/R10/R8/R9) onto the stack, reflecting the different\r\nsyscall ABI convention for this platform.\r\nThe location of the ia32_sysenter_target entry point gets written to a model-specific register (MSR) at kernel\r\nstart (in enable_sep_cpu()); in this case, the MSR in question is MSR_IA32_SYSENTER_EIP (0x176), which is\r\nused for handling the SYSENTER instruction.\r\nThis shows the invocation path from userspace. The standard modern ABI for how x86_32 programs invoke a\r\nsystem call is to put the system call number (3 for read()) into the EAX register, and the other parameters into\r\nspecific registers (EBX, ECX, and EDX for the first 3 parameters), then invoke the SYSENTER instruction.\r\nThis instruction causes the processor to transition to ring 0 and invoke the code referenced by the\r\nMSR_IA32_SYSENTER_EIP model-specific register — namely ia32_sysenter_target. That code pushes the\r\nregisters onto the (kernel) stack, and calls the function pointer at entry EAX in sys_call_table — namely\r\nsys_read(), which is a thin, asmlinkage wrapper for the real implementation in SYSC_read().\r\nx86_32 syscall invocation via INT 0x80\r\nThe sys_call_table table is also accessed in arch/x86/kernel/entry_32.S from the system_call assembly\r\nentry point. Again, this entry point saves registers to the stack, then uses the EAX register to pick the relevant\r\nentry in sys_call_table and call it. This time, the location of the system_call entry point is used by\r\ntrap_init():\r\n #ifdef CONFIG_X86_32\r\n set_system_trap_gate(SYSCALL_VECTOR, \u0026system_call);\r\n set_bit(SYSCALL_VECTOR, used_vectors);\r\n #endif\r\nThis sets up the handler for the SYSCALL_VECTOR trap to be system_call; that is, it sets it up to be the recipient of\r\nthe venerable INT 0x80 software interrupt method for invoking system calls.\r\nThis is the original user-space invocation path for system calls, which is now generally avoided because, on\r\nmodern processors, it's slower than the instructions that are specifically designed for system call invocation\r\n(SYSCALL and SYSENTER).\r\nWith this older ABI, programs invoke a system call by putting the system call number into the EAX register, and\r\nthe other parameters into specific registers (EBX, ECX, and EDX for the first 3 parameters), then invoking the\r\nINT 0x80 instruction. This instruction causes the processor to transition to ring 0 and invoke the trap handler for\r\nhttps://lwn.net/Articles/604515/\r\nPage 2 of 8\n\nsoftware interrupt 0x80 — namely system_call. The system_call code pushes the registers onto the (kernel)\r\nstack, and calls the function pointer at entry EAX in the sys_call_table table — namely sys_read(), the thin,\r\nasmlinkage wrapper for the real implementation in SYSC_read(). Much of that should seem familiar, as it is the\r\nsame as for using SYSENTER.\r\nx86 syscall invocation mechanisms\r\nFor reference, the different user-space syscall invocation mechanisms on x86 we've seen so far are as follows:\r\n64-bit programs use the SYSCALL instruction. This instruction was originally introduced by AMD, but was\r\nsubsequently implemented on Intel 64-bit processors and so is the best choice for cross-platform\r\ncompatibility.\r\nModern 32-bit programs use the SYSENTER instruction, which has been present since Intel introduced the\r\nIA-32 architecture.\r\nAncient 32-bit programs use the INT 0x80 instruction to trigger a software interrupt handler, but this is\r\nmuch slower than SYSENTER on modern processors.\r\nx86_32 syscall invocation on x86_64\r\nNow for a more complicated case: what happens if we are running a 32-bit binary on our x86_64 system? From\r\nthe user-space perspective, nothing is different; in fact, nothing can be different, because the user code being run is\r\nexactly the same.\r\nFor the SYSENTER case, an x86_64 kernel registers a different function as the handler in the\r\nMSR_IA32_SYSENTER_EIP model-specific register. This function has the same name (ia32_sysenter_target) as\r\nthe x86_32 code but a different definition (in arch/x86/ia32/ia32entry.S). In particular, it pushes the old-style\r\nregisters but uses a different syscall table, ia32_sys_call_table. This table is built from the 32-bit table of\r\nentries; in particular, it will have entry 3 (as used on 32-bit systems), rather than 0 (which is the syscall number for\r\nread() on 64-bit systems), mapping to sys_read().\r\nFor the INT 0x80 case, the trap_init() code on x86_64 instead invokes:\r\n #ifdef CONFIG_IA32_EMULATION\r\n set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);\r\n set_bit(IA32_SYSCALL_VECTOR, used_vectors);\r\n #endif\r\nThis maps the IA32_SYSCALL_VECTOR (which is still 0x80) to ia32_syscall. This assembly entry point (in\r\narch/x86/ia32/ia32entry.S) uses ia32_sys_call_table rather than the 64-bit sys_call_table.\r\nA more complex example: execve and 32-bit compatibility handling\r\nhttps://lwn.net/Articles/604515/\r\nPage 3 of 8\n\nNow let's look at a system call that involves other complications: execve(). We'll again work outward from the\r\nkernel implementation of the system call, and explore the differences from the simpler read() call along the way.\r\nAgain, a clickable map of the territory we're going to explore might help things along:\r\nThe execve() definition in fs/exec.c looks similar to that for read(), but there is an interesting additional\r\nfunction defined right after it (at least when CONFIG_COMPAT is defined):\r\n SYSCALL_DEFINE3(execve,\r\n const char __user *, filename,\r\n const char __user *const __user *, argv,\r\n const char __user *const __user *, envp)\r\n {\r\n return do_execve(getname(filename), argv, envp);\r\n }\r\n #ifdef CONFIG_COMPAT\r\n asmlinkage long compat_sys_execve(const char __user * filename,\r\n const compat_uptr_t __user * argv,\r\n const compat_uptr_t __user * envp)\r\n {\r\n return compat_do_execve(getname(filename), argv, envp);\r\nhttps://lwn.net/Articles/604515/\r\nPage 4 of 8\n\n}\r\n #endif\r\nFollowing the processing path, these two implementations converge on do_execve_common() to perform the real\r\nwork (sys_execve() → do_execve() → do_execve_common() versus compat_sys_execve() →\r\ncompat_do_execve() → do_execve_common()), setting up user_arg_ptr structures along the way. These\r\nstructures hold those syscall arguments that are pointers-to-pointers, together with an indication of whether they\r\ncome from a 32-bit compatibility ABI; if so, the value being pointed to is a 32-bit user-space address, not a 64-bit\r\nvalue, and the code to copy the argument values from user space needs to allow for that.\r\nSo, unlike read(), where the syscall implementation didn't need to distinguish between 32-bit and 64-bit callers\r\nbecause the arguments were pointers-to-values, execve() does need to distinguish, because it has arguments that\r\nare pointers-to-pointers. This turns out to be a common theme — other compat_sys_name() entry points are there\r\nto cope with pointer-to-pointer arguments (or pointer-to-struct-containing-pointer arguments, for example\r\nstructiovec or structaiocb).\r\nx32 ABI support\r\nThe complication of having two variant implementations of execve() spreads outward from the code to the\r\nsystem call tables. For x86_64, the 64-bit table has two distinct entries for execve():\r\n 59 64 execve stub_execve\r\n ...\r\n 520 x32 execve stub_x32_execve\r\nThe additional entry in the 64-bit table at syscall number 520 is for x32 ABI programs, which run on x86_64\r\nprocessors but use 32-bit pointers. As a result of the 64 and x32 ABI indicators, we will end up with stub_execve\r\nas entry 59 in sys_call_table, and stub_x32_execve as entry 520.\r\nAlthough this is our first mention of the x32 ABI, it turns out that our previous read() example did quietly\r\ninclude x32 ABI compatibility. As no pointer-to-pointer address translation was needed, the syscall invocation\r\npath (and syscall number) could simply be shared with the 64-bit version.\r\nBoth stub_execve and stub_x32_execve are defined in arch/x86/kernel/entry_64.S. These entry points call\r\non sys_execve() and compat_sys_execve(), but also save additional registers (R12-R15, RBX, and RBP) to\r\nthe kernel stack. Similar stub_* wrappers are also present in arch/x86/kernel/entry_64.S for other syscalls\r\n(rt_sigreturn(), clone(), fork(), and vfork()) that may potentially need to restart user-space execution at a\r\ndifferent address and/or with a different user stack than when the syscall was invoked.\r\nFor x86_32, the 32-bit table has an entry for execve() that's slightly different in format from that for read():\r\n 11 i386 execve sys_execve stub32_execve\r\nhttps://lwn.net/Articles/604515/\r\nPage 5 of 8\n\nFirst of all, this tells us that execve() has syscall number of 11 on 32-bit systems, as compared to number 59 (or\r\n520) on 64-bit systems. More interesting to observe is the presence of an extra field in the 32-bit table, holding a\r\ncompatibility entry point stub32_execve. For a native 32-bit build of the kernel, this extra field is ignored and the\r\nsys_call_table holds sys_execve() as entry 11, as usual.\r\nHowever, for a 64-bit build of the kernel, the IA-32 compatibility code inserts the stub32_execve() entry point\r\ninto ia32_sys_call_table as entry 11. This entry point is defined in arch/x86/ia32/ia32entry.S as:\r\n PTREGSCALL stub32_execve, compat_sys_execve\r\nThe\r\nPTREGSCALL\r\nmacro sets up the\r\nstub32_execve\r\nentry point to call on to\r\ncompat_sys_execve()\r\n(by putting its address into RAX), and saves additional registers (R12-R15, RBX, and RBP) to the kernel stack\r\n(like\r\nstub_execve()\r\nabove).\r\ngettimeofday(): vDSO\r\nSome system calls just read a small amount of information from the kernel, and for these, the full machinery of a\r\nring transition is a lot of overhead. The vDSO (Virtual Dynamically-linked Shared Object) mechanism speeds up\r\nsome of these read-only syscalls by mapping the page containing the relevant information (and code to read it)\r\ninto user space, read-only. In particular, the page is set up in the format of an ELF shared-library, so it can be\r\nstraightforwardly linked into user programs.\r\nRunning ldd on a normal glibc-using binary shows the vDSO as a dependency on linux-vdso.so.1 or linux-gate.so.1 (which ldd obviously can't find a file to back); it also shows up in the memory map of a running\r\nprocess ([vdso] in cat /proc/PID/maps).\r\nHistorically, vsyscall was an earlier mechanism to do something similar, which is now deprecated due to security\r\nconcerns. This older article by Johan Petersson describes how vsyscall's page appears as an ELF object (at a fixed\r\nposition) to user space.\r\nThere's a Linux Journal article that discusses vDSO setup in some detail (although it is now slightly out of date),\r\nso we'll just describe the basics here, as applied to the gettimeofday() syscall.\r\nhttps://lwn.net/Articles/604515/\r\nPage 6 of 8\n\nFirst, gettimeofday() needs to access data. To allow this, the relevant vsyscall_gtod_data structure is\r\nexported into a special data section called .vvar_vsyscall_gtod_data. Linker instructions then ensure that this\r\n.vvar_vsyscall_gtod_data section is linked into the kernel in the __vvar_page section, and at kernel startup\r\nthe setup_arch() function calls map_vsyscall() to set up a fixed mapping for that __vvar_page.\r\nThe code that provides the core vDSO implementation of gettimeofday() is in __vdso_gettimeofday(). It's\r\nmarked as notrace to prevent the compiler from ever adding function profiling, and also gets a weak alias as\r\ngettimeofday(). To ensure that the resulting page looks like an ELF shared object, the vdso.lds.S file pulls in\r\nvdso-layout.lds.S and exports both gettimeofday() and __vdso_gettimeofday() into the page.\r\nTo make the vDSO page accessible to a new user-space program, the code in setup_additional_pages() sets\r\nthe vDSO page location to a random address chosen by vdso_addr() at process start time. Using a random\r\naddress mitigates the security problems found with the earlier vsyscall implementation, but does mean that the\r\nuser program needs a way to find the location of the vDSO page. The location is exposed to user space as an ELF\r\nauxiliary value: the binary loader for ELF format programs (load_elf_binary()) uses the ARCH_DLINFO macro\r\nto set the AT_SYSINFO_EHDR auxiliary value. The user-space program can then find the page using the getauxval()\r\nfunction to retrieve the relevant auxiliary value (although in practice the libc library usually takes care of this\r\nunder the covers).\r\nFor completeness, we should also mention that the vDSO mechanism is used for another important syscall-related\r\nfeature for 32-bit programs. At boot time, the kernel determines which of the possible x86_32 syscall invocation\r\nmechanisms is best, and puts the appropriate implementation wrapper (SYSENTER, INT0x80, or even SYSCALL for\r\nan AMD 64-bit processor) into the __kernel_vsyscall function. User-space programs can then invoke this\r\nwrapper and be sure of getting the fastest way into the kernel for their syscalls; see Petersson's article for more\r\ndetails.\r\nptrace(): syscall tracing\r\nThe ptrace() system call is implemented in the normal manner, but it's particularly relevant here because it can\r\ncause system calls of the traced kernel task to behave differently. Specifically, the PTRACE_SYSCALL request aims\r\nto \"“arrange for the tracee to be stopped at the next entry to or exit from a system call”\".\r\nRequesting PTRACE_SYSCALL causes the TIF_SYSCALL_TRACE thread information flag to be set in thread-specific\r\ndata (struct thread_info.flags). The effect of this is architecture-specific; we'll describe the x86_64\r\nimplementation.\r\nLooking more closely at the assembly for syscall entry (in all of x86_32, x86_64, and ia32) we see a detail that we\r\nskipped over previously: if the thread flags have any of the _TIF_WORK_SYSCALL_ENTRY flags (which include\r\nTIF_SYSCALL_TRACE) set, the syscall implementation code follows a different path to invoke\r\nsyscall_trace_enter() instead (x86_32, x86_64, ia32). The syscall_trace_enter() function then performs\r\na variety of different functions that are associated with the various per-thread flag values that were checked for\r\nwith _TIF_WORK_SYSCALL_ENTRY:\r\nTIF_SINGLESTEP: single stepping of instructions for ptrace\r\nTIF_SECCOMP: perform secure computing checks on syscall entry\r\nhttps://lwn.net/Articles/604515/\r\nPage 7 of 8\n\nTIF_SYSCALL_EMU: perform syscall emulation\r\nTIF_SYSCALL_TRACE: syscall tracing for ptrace\r\nTIF_SYSCALL_TRACEPOINT: syscall tracing for ftrace\r\nTIF_SYSCALL_AUDIT: generation of syscall audit records\r\nIn other words,\r\nsyscall_trace_enter\r\nis the control point for a whole collection of different per-syscall interception functionality — including\r\nTIF_SYSCALL_TRACE\r\nsyscall tracing. It ends up calling\r\nptrace_stop()\r\nwith\r\nwhy=CLD_TRAPPED\r\n, which notifies the tracing program (via\r\nSIGCHLD\r\n) that the tracee has been stopped on entry to a syscall.\r\nEpilogue\r\nSystem calls have been the standard method for user-space programs to interact with Unix kernels for decades\r\nand, consequently, the Linux kernel includes a set of facilities to make it easy to define them and to efficiently use\r\nthem. Despite the invocation variations across architectures and occasional special cases, system calls also remain\r\na remarkably homogeneous mechanism — this stability and homogeneity allows all sorts of useful tools, from\r\nstrace to seccomp-bpf, to work in a generic way.\r\nIndex entries for this article\r\nKernel System calls\r\nGuestArticles Drysdale, David\r\nSource: https://lwn.net/Articles/604515/\r\nhttps://lwn.net/Articles/604515/\r\nPage 8 of 8",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"MITRE"
	],
	"references": [
		"https://lwn.net/Articles/604515/"
	],
	"report_names": [
		"604515"
	],
	"threat_actors": [],
	"ts_created_at": 1775434798,
	"ts_updated_at": 1775791239,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/e4c6433e2494d4c2f24e4f4595ed4aaf9ae675b0.pdf",
		"text": "https://archive.orkl.eu/e4c6433e2494d4c2f24e4f4595ed4aaf9ae675b0.txt",
		"img": "https://archive.orkl.eu/e4c6433e2494d4c2f24e4f4595ed4aaf9ae675b0.jpg"
	}
}