diff --git a/src/references.bib b/src/references.bib index 24609b0..e0d59fe 100644 --- a/src/references.bib +++ b/src/references.bib @@ -725,7 +725,7 @@ @manual{ARM-MPIDR-EL1 @manual{Intel-proc-ids, title = {Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3: System Programming Guide}, author = {Intel Corporation}, - note = {See chapter 10.4.5 - Identyfing Logical Processors in an MP System}, + note = {See chapter 10.4.5 - Identifying Logical Processors in an MP System}, url = {https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html} } diff --git a/src/thesis-en.tex b/src/thesis-en.tex index 0d83925..9d5904b 100644 --- a/src/thesis-en.tex +++ b/src/thesis-en.tex @@ -45,7 +45,7 @@ \renewcommand{\chaptermark}[1]{\markboth{\thechapter.\ #1}{}} -\renewcommand{\headrulewidth}{0 pt} +\renewcommand{\headrulewidth}{0pt} \fancypagestyle{plain}{ @@ -224,7 +224,7 @@ -% -------- TABLES AD FIGURES NUMBERING -------- +% -------- TABLES AND FIGURES NUMBERING -------- \renewcommand*{\thetable}{\arabic{chapter}.\arabic{table}} \renewcommand*{\thefigure}{\arabic{chapter}.\arabic{figure}} @@ -255,10 +255,6 @@ % ------- END OF PREAMBLE PART (MOSTLY) ---------- - - - - % ---------- USER SETTINGS --------- \newcommand{\tytul}{Funkcjonalne jądro systemu operacyjnego: AlkOS} @@ -276,21 +272,6 @@ \null\thispagestyle{empty}\newpage -% ------ PAGE WITH SIGNATURES ------------ - -%\thispagestyle{empty}\newpage -%\null -% -%\vfill -% -%\begin{center} -%\begin{tabular}[t]{ccc} -%............................................. & \hspace*{100pt} & .............................................\\ -%supervisor's signature & \hspace*{100pt} & author's signature -%\end{tabular} -%\end{center} -% - % ---------- ABSTRACT ----------- @@ -326,31 +307,13 @@ \end{abstract} } -%% --------- DECLARATIONS ------------ -% -%% -%% IT IS NECESSARY OT ATTACH FILLED-OUT AUTORSHIP DEECLRATION. SCAN (IN PDF FORMAT) NEEDS TO BE PLACED IN scans FOLDER AND IT SHOULD BE CALLED, FOR EXAMPLE, DECLARATION_OF_AUTORSHIP.PDF. IF THE FILENAME OR FILEPATH IS DIFFERENT, THE FILEPATH IN THE NEXT COMMAND HAS TO BE ADJUSTED ACCORDINGLY. -%% -%% command attacging the declarations of autorship -%% -%\includepdf[pages=-]{scans/declaration-of-autorship} -%\null\thispagestyle{empty}\newpage -% -%% optional declaration -%% -%% command attaching the declaataration on granting a license -%% -%\includepdf[pages=-]{scans/declaration-on-granting-a-license} -%% -%% .tex corresponding to the above PDF files are present in the 3. declarations folder -% \null\thispagestyle{empty}\newpage % ------- TABLE OF CONTENTS ------- \selectlanguage{english} \pagenumbering{gobble} \tableofcontents \thispagestyle{empty} -\newpage % IF YOU HAVE EVEN QUANTITY OD PAGES OF TOC, THEN REMOVE IT OR ADD \null\newpage FOR DOUBLE BLANK PAGE BEFORE INTRODUCTION +\newpage % IF YOU HAVE EVEN QUANTITY OF PAGES OF TOC, THEN REMOVE IT OR ADD \null\newpage FOR DOUBLE BLANK PAGE BEFORE INTRODUCTION % -------- THE BODY OF THE THESIS ------------ @@ -361,8 +324,6 @@ \setcounter{page}{11} \chapter{Introduction} \label{chap:introduction} -% \markboth{}{Introduction} -% \addcontentsline{toc}{chapter}{Introduction} \section{Background and Context} @@ -386,17 +347,17 @@ \subsection{Motivation} \subsection{Scale} -Given the magnitude of the task, operating system development is predominantly an enterprise-level initiative involving large teams of seasoned professionals. Independent or hobbyist operating systems, developed by individuals or small groups, are exceptionally rare. Notably, even in an era dominated by Windows, macOS, and Linux, new operating systems continue to emerge --- as evidenced by reports of Google developing an Android-based operating system with AI at its core \cite{google_aluminium_os}. +Given the magnitude of the task, operating system development is predominantly an enterprise-level initiative involving large teams of seasoned professionals. Independent or hobbyist operating systems, developed by individuals or small groups, are exceptionally rare. Notably, even in an era dominated by Windows, macOS, and Linux, new operating systems continue to emerge --- as evidenced by reports of Google developing an Android-based operating system with AI at its core~\cite{google_aluminium_os}. -To put this scarcity into perspective, consider the software landscape. The number of PC games is estimated in the tens of thousands, with over 121,000 available on Steam alone as of early 2026. Mobile applications number in the millions, with approximately 2.18 million on the Google Play Store \cite{stats_google_play}. The web is even more vast, with the total number of websites estimated at 1.34 billion \cite{stats_websites}. All of these software artifacts rely on the operating system as their host. +To put this scarcity into perspective, consider the software landscape. The number of PC games is estimated in the tens of thousands, with over 121,000 available on Steam alone as of early 2026. Mobile applications number in the millions, with approximately 2.18 million on the Google Play Store~\cite{stats_google_play}. The web is even more vast, with the total number of websites estimated at 1.34 billion~\cite{stats_websites}. All of these software artifacts rely on the operating system as their host. -In contrast, the number of operating systems is orders of magnitude smaller. While exact figures are difficult to determine due to the historical prevalence of proprietary mainframe systems, current estimates are telling. The number of active open-source operating systems --- the majority being forks of Linux and BSD --- is tracked at around 900 \cite{stats_distrowatch}. Wikipedia lists approximately 300 notable proprietary and historical operating systems \cite{wiki_list_os}, and there are roughly 100 to 150 market-leading real-time operating systems (RTOS) \cite{wiki_rtos}. As for hobbyist kernels created from scratch? The OSDev wiki lists only about 175 active projects \cite{osdev_projects}. +In contrast, the number of operating systems is orders of magnitude smaller. While exact figures are difficult to determine due to the historical prevalence of proprietary mainframe systems, current estimates are telling. The number of active open-source operating systems --- the majority being forks of Linux and BSD --- is tracked at around 900~\cite{stats_distrowatch}. Wikipedia lists approximately 300 notable proprietary and historical operating systems~\cite{wiki_list_os}, and there are roughly 100 to 150 market-leading real-time operating systems (RTOS)~\cite{wiki_rtos}. As for hobbyist kernels created from scratch? The OSDev wiki lists only about 175 active projects~\cite{osdev_projects}. The creation of a functional kernel from scratch by a team of three students is, therefore, a statistically significant undertaking. \subsection{Difficulties} -The primary challenge in independent operating system development is the scarcity of high-quality educational materials. This is a direct consequence of the scale of the field; the community is simply too small to sustain a paved learning path. The OSDev Wiki \cite{osdev_main} stands as the central resource for hobbyist developers, yet without it, the task would be nearly impossible. Creating an operating system requires knowledge not just of how the hardware is constructed (as detailed in manuals), but of how to effectively drive it --- a distinction akin to knowing how a car is built versus knowing how to drive it at a professional level. +The primary challenge in independent operating system development is the scarcity of high-quality educational materials. This is a direct consequence of the scale of the field; the community is simply too small to sustain a paved learning path. The OSDev Wiki~\cite{osdev_main} stands as the central resource for hobbyist developers, yet without it, the task would be nearly impossible. Creating an operating system requires knowledge not just of how the hardware is constructed (as detailed in manuals), but of how to effectively drive it --- a distinction akin to knowing how a car is built versus knowing how to drive it at a professional level. While the OSDev community is invaluable, its resources often suffer from a lack of depth or coherence. Many articles convey high-level ideas but lack specific implementation details, while others are outdated or incorrect. Tutorials frequently stop shortly after the operating system equivalent of the "Hello World" stage --- establishing a minimal bootable kernel that prints to the screen --- leaving the developer to bridge the massive gap to a fully functional system alone. @@ -404,7 +365,7 @@ \subsection{Difficulties} Beyond the educational barrier lie the technical hurdles: implementing device drivers, creating memory management algorithms, and designing task schedulers. However, for the uninitiated, the lack of a structured theoretical framework remains the most formidable obstacle. -This thesis aims to address both aspects. It documents the technical creation of the AlkOS kernel --- our design choices, the problems we encountered, and the solutions we engineered (detailed in Chapters \ref{chap:low_level_implementation} and \ref{chap:high_level_subsystems}, with corresponding source code attached to this thesis). Simultaneously, it serves as a guide to the problem space itself (specifically in Chapter \ref{chap:os_from_scratch}), intended to assist future developers in their attempt to create an operating system from scratch. +This thesis aims to address both aspects. It documents the technical creation of the AlkOS kernel --- our design choices, the problems we encountered, and the solutions we engineered (detailed in Chapters~\ref{chap:low_level_implementation} and~\ref{chap:high_level_subsystems}, with corresponding source code attached to this thesis). Simultaneously, it serves as a guide to the problem space itself (specifically in Chapter~\ref{chap:os_from_scratch}), intended to assist future developers in their attempt to create an operating system from scratch. \section{Scope of the Thesis} \label{subsec:scope} @@ -525,7 +486,7 @@ \section{Work Division} \item \textbf{Virtual File System (VFS):} Designed the core VFS abstraction layer, mount point management using efficient prefix matching, and a three-tier file descriptor management system supporting file and pipe resources. \item \textbf{FAT File System Driver:} Implemented FAT driver that unifies FAT12, FAT16, and FAT32 logic, abstracting I/O operations to support both block devices and in-memory ramdisks. \item \textbf{System Call Interface:} Designed the user space transition mechanism, utilizing a compile-time type-deducing dispatch table to automatically map CPU registers to C++ function arguments, ensuring ABI compliance. - \item \textbf{Hardware \& ACPI:} Integrated the \texttt{uACPI} \cite{uacpi_repo} library with the kernel, implemented the OS abstraction layer (OSL), designed the ACPI table retrieval mechanism, and implemented shutdown and reboot procedures via ACPI. + \item \textbf{Hardware \& ACPI:} Integrated the \texttt{uACPI}~\cite{uacpi_repo} library with the kernel, implemented the OS abstraction layer (OSL), designed the ACPI table retrieval mechanism, and implemented shutdown and reboot procedures via ACPI. \end{itemize} \item \textbf{Data Structures and Standard Library:} @@ -613,7 +574,7 @@ \subsection{Cross-Compilation Toolchain} The development process begins with the establishment of a cross-compilation toolchain targeting a generic, OS-independent architecture (e.g., \texttt{x86-64-elf}). Using a compiler provided by the host system may lead to issues, as such compilers typically assume that the generated code will execute within an operating system environment. These assumptions are invalid during development, where no such infrastructure exists initially. -For this reason, the cross-compiler is built from source, allowing it to be configured and, if necessary, patched to accommodate the specific requirements of kernel and user space development. A complete cross-compilation toolchain typically consists of a cross-compiler (e.g., GCC or Clang), a suite of binary utilities (including the assembler and linker), and optionally a cross-debugger such as GDB or LLDB. Although any suitable cross-compiler may be used to build an operating system, GCC is generally preferred due to extensive support within the operating systems development community. For a detailed guide to building a GCC-based cross-compilation toolchain, refer to \cite{osdev-gcc-cross-compiler}. +For this reason, the cross-compiler is built from source, allowing it to be configured and, if necessary, patched to accommodate the specific requirements of kernel and user space development. A complete cross-compilation toolchain typically consists of a cross-compiler (e.g., GCC or Clang), a suite of binary utilities (including the assembler and linker), and optionally a cross-debugger such as GDB or LLDB. Although any suitable cross-compiler may be used to build an operating system, GCC is generally preferred due to extensive support within the operating systems development community. For a detailed guide to building a GCC-based cross-compilation toolchain, refer to~\cite{osdev-gcc-cross-compiler}. At later stages of development, particularly when building user space applications, an extended version of the toolchain is required. This toolchain must be capable of automatically linking against our C standard library and including system headers. The process of adapting the toolchain to recognize the new operating system as a valid target is discussed in Subsection~\ref{subsubsec:os-specific-toolchain}. @@ -634,7 +595,7 @@ \subsubsection{Selection of Build Tools} \subsubsection{Toolchain Integration} -As required by the freestanding development model described in Section \ref{subsec:theory_toolchain}, the build system must be configured to use the previously established cross-compilation toolchain. This is typically achieved through the use of \textbf{toolchain configuration files}, which act as an adapter layer. +As required by the freestanding development model described in Section~\ref{subsec:theory_toolchain}, the build system must be configured to use the previously established cross-compilation toolchain. This is typically achieved through the use of \textbf{toolchain configuration files}, which act as an adapter layer. A toolchain configuration file must fulfill two primary functions: \begin{enumerate} @@ -669,7 +630,7 @@ \subsubsection{Compilation Profiles} \subsection{Emulation} -It is essential to establish an efficient testing strategy. In this context, hardware emulation becomes a critical tool. Two primary approaches exist for testing kernel software. The first involves utilizing a dedicated physical test machine, where the system image is flashed and booted for each iteration. While verifying kernel behavior on real hardware is strictly necessary to guarantee correctness, this process is time-consuming and often becomes a bottleneck during rapid development cycles. To address this efficiency issue, various emulation solutions are available, including QEMU \cite{qemu_website}, Bochs (x86 only) \cite{bochs_website}, and VirtualBox \cite{virtualbox_website}. These tools allow for the specification of the target machine architecture, enabling the kernel to run locally within the host system in an emulated environment. This setup significantly accelerates the workflow by facilitating quick execution, debugging, and state exploration. Consequently, the ideal solution combines both methods to maximize development speed while ensuring software quality. +It is essential to establish an efficient testing strategy. In this context, hardware emulation becomes a critical tool. Two primary approaches exist for testing kernel software. The first involves utilizing a dedicated physical test machine, where the system image is flashed and booted for each iteration. While verifying kernel behavior on real hardware is strictly necessary to guarantee correctness, this process is time-consuming and often becomes a bottleneck during rapid development cycles. To address this efficiency issue, various emulation solutions are available, including QEMU~\cite{qemu_website}, Bochs (x86 only)~\cite{bochs_website}, and VirtualBox~\cite{virtualbox_website}. These tools allow for the specification of the target machine architecture, enabling the kernel to run locally within the host system in an emulated environment. This setup significantly accelerates the workflow by facilitating quick execution, debugging, and state exploration. Consequently, the ideal solution combines both methods to maximize development speed while ensuring software quality. \section{Implementation of the Standard Library} @@ -691,18 +652,18 @@ \subsubsection{Development Methodologies} \paragraph{Implementation from Scratch} -Alternatively, the standard library may be developed specifically for the operating system. This approach offers full control over memory usage, performance characteristics, and integration with kernel-specific features. However, it introduces significant complexity. Comparative studies of existing C standard libraries demonstrate substantial variation in performance, memory footprint, and standards compliance, even among mature implementations \cite{libc_comparison}. Achieving conformance with the language standard requires careful handling of numerous corner cases, including floating-point behavior, complex string formatting semantics, and locale support. Even minor deviations from the specified behavior can lead to subtle incompatibilities when porting third-party software. +Alternatively, the standard library may be developed specifically for the operating system. This approach offers full control over memory usage, performance characteristics, and integration with kernel-specific features. However, it introduces significant complexity. Comparative studies of existing C standard libraries demonstrate substantial variation in performance, memory footprint, and standards compliance, even among mature implementations~\cite{libc_comparison}. Achieving conformance with the language standard requires careful handling of numerous corner cases, including floating-point behavior, complex string formatting semantics, and locale support. Even minor deviations from the specified behavior can lead to subtle incompatibilities when porting third-party software. \subsubsection{User Space vs. Kernel Space Variants} -Standard library functionality is required in both user space and kernel space. However, these environments impose fundamentally different constraints. User space code executes without privileges and relies on system call interface, whereas kernel code executes in a privileged context, invokes kernel services directly rather than through system calls, and must comply with constraints imposed by the kernel execution environment, such as interrupt safety. As a result, a single, uniform library implementation is insufficient. +Standard library functionality is required in both user space and kernel space. However, these environments impose fundamentally different constraints. User space code executes without privileges and relies on the system call interface, whereas kernel code executes in a privileged context, invokes kernel services directly rather than through system calls, and must comply with constraints imposed by the kernel execution environment, such as interrupt safety. As a result, a single, uniform library implementation is insufficient. To address this, the library is split into a user space variant (\texttt{libc}) and a kernel variant (\texttt{libk}), each compiled with different configuration flags and assumptions. This separation enforces correct usage and prevents kernel code from accidentally relying on functionality that is unavailable or unsafe in kernel context. \subsubsection{Program Initialization and the C Runtime (CRT)} \label{subsubsec:theory_crt} -In addition to exposing user-facing APIs, the runtime environment must be initialized before control is transferred to the program's entry function. User space programs do not begin execution at \texttt{main}. Instead, execution starts at a library-provided entry point, conventionally named \texttt{\_start}, which is supplied by a startup object file \cite{osdev_crt}. +In addition to exposing user-facing APIs, the runtime environment must be initialized before control is transferred to the program's entry function. User space programs do not begin execution at \texttt{main}. Instead, execution starts at a library-provided entry point, conventionally named \texttt{\_start}, which is supplied by a startup object file~\cite{osdev_crt}. This initialization sequence is realized through coordinated interaction between the linker, compiler, and standard library, and typically involves a set of well-defined object files: @@ -788,7 +749,7 @@ \subsubsection{Program Initialization and the C Runtime (CRT)} \section{Bootloader} \label{subsec:theory_bootloader} -The process of bringing a computer from a powered-off state to a fully functional operating system is governed by a rigid chain of physical and logical constraints. At the hardware level, the Central Processing Unit (CPU) functions as a complex state machine. Upon the application of power or a reset signal, the CPU resets its internal registers to default values and sets the Instruction Pointer to a specific, hardcoded physical address known as the \textit{Reset Vector} \cite{IntelManual-Reset}. +The process of bringing a computer from a powered-off state to a fully functional operating system is governed by a rigid chain of physical and logical constraints. At the hardware level, the Central Processing Unit (CPU) functions as a complex state machine. Upon the application of power or a reset signal, the CPU resets its internal registers to default values and sets the Instruction Pointer to a specific, hardcoded physical address known as the \textit{Reset Vector}~\cite{IntelManual-Reset}. \subsection{The Memory Paradox and Storage} A fundamental challenge in this sequence is the source of the initial instructions. The standard Random Access Memory (RAM), which serves as the primary workspace for modern operating systems, is volatile. It requires active electrical flow to maintain its state. When the system is powered off, the state is lost. Upon power-up, the memory cells contain random garbage data. Consequently, the CPU cannot fetch valid instructions from standard RAM immediately after a reset. @@ -796,7 +757,7 @@ \subsection{The Memory Paradox and Storage} To resolve this, hardware architects map the Reset Vector address to a non-volatile memory region, typically Flash Memory or Read-Only Memory (ROM), which retains data without power. \subsection{Embedded vs. Complex Architectures} -In simple embedded architectures (e.g., microcontrollers used in household appliances like washing machines or microwaves), the entire application code is often stored in this non-volatile memory. The memory controller maps this storage directly into the CPU's addressable space. This technique, known as \textbf{Execute In Place (XIP)}, allows the CPU to fetch and execute the developer's code from the very first clock cycle \cite{ARM-CortexM4-Generic-User-Guide}. The developer "owns" the machine from the first nanosecond. +In simple embedded architectures (e.g., microcontrollers used in household appliances like washing machines or microwaves), the entire application code is often stored in this non-volatile memory. The memory controller maps this storage directly into the CPU's addressable space. This technique, known as \textbf{Execute In Place (XIP)}, allows the CPU to fetch and execute the developer's code from the very first clock cycle~\cite{ARM-CortexM4-Generic-User-Guide}. The developer "owns" the machine from the first nanosecond. In contrast, more complex architectures (such as ARM-based smartphones or single-board computers like the Raspberry Pi) often store the main operating system on external, complex storage media like SD cards or eMMC chips. The CPU cannot simply memory-map an SD card. It requires a sophisticated software driver to communicate with the storage controller. To bridge this gap, manufacturers embed a tiny, immutable piece of software called the \textbf{BootROM} directly into the silicon. This code initializes the minimal required hardware (often internal SRAM) and loads a secondary bootloader from the external storage into that SRAM, which in turn loads the main software. @@ -897,11 +858,11 @@ \subsection{Embedded vs. Complex Architectures} \item \textbf{DRAM Training:} Modern DDR4/DDR5 memory requires a complex calibration process to align signal timing before it becomes usable. \end{enumerate} -To manage this, modern chipsets often include a smaller, dedicated processor (e.g., the Intel Management Engine or AMD Platform Security Processor) that starts before the main CPU. This co-processor initializes the platform hardware to a state where the main CPU can begin execution \cite{Intel-Datasheet-Vol1}. +To manage this, modern chipsets often include a smaller, dedicated processor (e.g., the Intel Management Engine or AMD Platform Security Processor) that starts before the main CPU. This co-processor initializes the platform hardware to a state where the main CPU can begin execution~\cite{Intel-Datasheet-Vol1}. \subsection{The Chain of Trust and Abstraction} \label{subsubsec:chain_of_trust_and_abstraction} -By the time a kernel begins execution, it is likely the fourth or fifth program in the boot chain. The entity responsible for defining the interface between the hardware and the OS is the \textbf{System Firmware} \cite{UEFI-Base-Spec, UEFI-PI-Spec}. +By the time a kernel begins execution, it is likely the fourth or fifth program in the boot chain. The entity responsible for defining the interface between the hardware and the OS is the \textbf{System Firmware}~\cite{UEFI-Base-Spec, UEFI-PI-Spec}. The firmware's responsibility is to abstract the diverse implementations of different motherboards (e.g., how the disk controller is wired) and provide a mechanism to load an OS from a disk into RAM. However, relying solely on firmware is often insufficient for a portable operating system: \begin{itemize} @@ -910,7 +871,7 @@ \subsection{The Chain of Trust and Abstraction} \item \textbf{Interface Variance:} The method used to retrieve a memory map or video configuration can vary wildly between hardware generations. \end{itemize} -To solve this, a \textbf{Third-Party Bootloader} is often utilized. This program acts as a "Normalizer" \cite{Limine-Spec}. It knows how to talk to various firmware types and storage devices. Its job is to abstract away the firmware differences, load the kernel file into memory, and pass control to the OS in a unified, predictable manner. +To solve this, a \textbf{Third-Party Bootloader} is often utilized. This program acts as a "Normalizer"~\cite{Limine-Spec}. It knows how to talk to various firmware types and storage devices. Its job is to abstract away the firmware differences, load the kernel file into memory, and pass control to the OS in a unified, predictable manner. \begin{figure}[htbp] \centering @@ -948,7 +909,7 @@ \subsection{The OS-Level Trampoline} \section{Memory Preloading and Discovery} \label{sec:mem_discovery} -One of the first and most critical responsibilities of a kernel during the bootstrap phase is to establish an authoritative map of the system's physical memory. Unlike user-space applications, which simply request memory from the operating system via system calls (e.g., \texttt{malloc} or \texttt{mmap}), the kernel is the manager responsible for fulfilling those requests. Upon entry, the kernel does not know how much RAM is available, where it is located, or which memory ranges are reserved for hardware-mapped I/O (MMIO). +One of the first and most critical responsibilities of a kernel during the bootstrap phase is to establish an authoritative map of the system's physical memory. Unlike user space applications, which simply request memory from the operating system via system calls (e.g., \texttt{malloc} or \texttt{mmap}), the kernel is the manager responsible for fulfilling those requests. Upon entry, the kernel does not know how much RAM is available, where it is located, or which memory ranges are reserved for hardware-mapped I/O (MMIO). This discovery process is not standardized. It is strictly coupled to the target architecture, the silicon vendor, and the residing firmware. Depending on the platform complexity, the kernel may acquire the memory map through one of three primary mechanisms: static definition, firmware interrogation, or hardware description structures. @@ -956,7 +917,7 @@ \subsection{Static Definition} On strictly embedded architectures (e.g., ARM Cortex-M or AVR), the physical memory layout is immutable. The location and size of SRAM banks, Flash storage, and peripheral registers are defined by the silicon vendor and do not change. In these environments, runtime discovery is redundant. -The memory map is hardcoded directly into the kernel's source code or linker scripts, matching the specific System-on-Chip (SoC) datasheet \cite{ARM-CortexM4-Generic-User-Guide}. The developer explicitly defines the boundary between kernel code, stack, and heap. As illustrated in Figure \ref{fig:cortex_m4_memory}, the address space is rigid. The kernel assumes ownership of specific addresses immediately upon reset without querying external entities. +The memory map is hardcoded directly into the kernel's source code or linker scripts, matching the specific System-on-Chip (SoC) datasheet~\cite{ARM-CortexM4-Generic-User-Guide}. The developer explicitly defines the boundary between kernel code, stack, and heap. As illustrated in Figure~\ref{fig:cortex_m4_memory}, the address space is rigid. The kernel assumes ownership of specific addresses immediately upon reset without querying external entities. In this context, the Operating System does not "discover" memory. The kernel code assumes these addresses are valid from the first instruction. For example, a Cortex-M4 kernel may be hardcoded to expect code at 0x00000000 and RAM at 0x20000000. If the software is flashed onto a different chip variant, it will simply fault. Flexibility is sacrificed for minimizing initialization overhead. @@ -1079,7 +1040,7 @@ \subsection{Static Definition} \draw (periph_region.south east) -- (periph_target); \end{tikzpicture} -\caption{Cortex-M4 Memory Map with Bit-banding regions (Adapted from \cite{ARM-CortexM4-Generic-User-Guide})} +\caption{Cortex-M4 Memory Map with Bit-banding regions (Adapted from~\cite{ARM-CortexM4-Generic-User-Guide})} \label{fig:cortex_m4_memory} \end{figure} @@ -1094,14 +1055,14 @@ \subsection{Flattened Device Tree (DTB)} \item \textbf{Strings Block}: A pool of null-terminated property names referenced by offset. \end{enumerate} -The kernel parses this blob at boot to discover available RAM. \cite{Devicetree-Spec} +The kernel parses this blob at boot to discover available RAM~\cite{Devicetree-Spec}. \subsection{Firmware Interrogation} On general-purpose platforms (x86-64), the hardware is modular. The kernel cannot predict the amount of installed RAM or the physical address map. In this scenario, the kernel must query the system firmware directly. This introduces a dependency on the firmware interface: \begin{itemize} - \item \textbf{Legacy BIOS:} Requires invoking interrupt vectors (e.g., \texttt{INT 0x15, EAX=0xE820}) to retrieve a list of memory ranges \cite{osdev-int15}. - \item \textbf{UEFI:} Requires calling specific boot services (\texttt{GetMemoryMap}) to retrieve descriptors of physical pages and their attributes \cite{UEFI-Base-Spec}. + \item \textbf{Legacy BIOS:} Requires invoking interrupt vectors (e.g., \texttt{INT 0x15, EAX=0xE820}) to retrieve a list of memory ranges~\cite{osdev-int15}. + \item \textbf{UEFI:} Requires calling specific boot services (\texttt{GetMemoryMap}) to retrieve descriptors of physical pages and their attributes~\cite{UEFI-Base-Spec}. \end{itemize} \subsection{Hardware Abstraction} @@ -1121,8 +1082,8 @@ \subsubsection{Feature Identification} While x86 relies on this dynamic instruction-based discovery, other architectures employ different strategies: \begin{itemize} - \item \textbf{ARM64 (AArch64)} utilizes special system registers (e.g., \texttt{ID\_AA64PFR0\_EL1}) that the kernel reads to determine support for floating-point units or cryptographic extensions \cite{ARM-Arch-Ref,}. - \item \textbf{RISC-V} typically employs the Device Tree Blob (DTB) or the \texttt{misa} (Machine ISA) Control and Status Register to inform the kernel about supported standard extensions (e.g., Atomics, Floats) \cite{RISCV-Priv-Spec}. + \item \textbf{ARM64 (AArch64)} utilizes special system registers (e.g., \texttt{ID\_AA64PFR0\_EL1}) that the kernel reads to determine support for floating-point units or cryptographic extensions~\cite{ARM-Arch-Ref}. + \item \textbf{RISC-V} typically employs the Device Tree Blob (DTB) or the \texttt{misa} (Machine ISA) Control and Status Register to inform the kernel about supported standard extensions (e.g., Atomics, Floats)~\cite{RISCV-Priv-Spec}. \end{itemize} \subsubsection{Feature Enablement} @@ -1142,7 +1103,7 @@ \subsubsection{Feature Enablement} \section{Establishment of Basic Communication} \label{sec:comms} -One of the primary objectives when initializing code on a target architecture is to establish an external communication channel. In an emulation environment, this is often achieved by interacting with the emulator's framework (e.g., QEMU utilizes a serial port \cite{wikibooks-serial} that can be attached to a Linux shell session). On physical hardware, the developer may need to render fonts on a screen (e.g., using the VGA standard on x86-64 desktop platforms \cite{osdev-barebones, osdev-vga}) or implement a basic network stack. It should be noted that at this stage, a rudimentary implementation is often sufficient. However, this serves as a provisional solution, to ensure correct handling in the future, a fully-fledged infrastructure and a proper hardware abstraction layer must be established. The preferred method should support bidirectional communication during the early development stages to facilitate testing and provide input to the kernel. This functionality is primarily required for debugging and testing, and as development progresses, it is advisable to abandon this simple communication or disable it via compilation flags. +One of the primary objectives when initializing code on a target architecture is to establish an external communication channel. In an emulation environment, this is often achieved by interacting with the emulator's framework (e.g., QEMU utilizes a serial port~\cite{wikibooks-serial} that can be attached to a Linux shell session). On physical hardware, the developer may need to render fonts on a screen (e.g., using the VGA standard on x86-64 desktop platforms~\cite{osdev-barebones, osdev-vga}) or implement a basic network stack. It should be noted that at this stage, a rudimentary implementation is often sufficient. However, this serves as a provisional solution, to ensure correct handling in the future, a fully-fledged infrastructure and a proper hardware abstraction layer must be established. The preferred method should support bidirectional communication during the early development stages to facilitate testing and provide input to the kernel. This functionality is primarily required for debugging and testing, and as development progresses, it is advisable to abandon this simple communication or disable it via compilation flags. % \begin{figure}[h] % \centering @@ -1151,12 +1112,12 @@ \section{Establishment of Basic Communication} % \label{fig:qemu_comms} % \end{figure} -% As shown in Figure \ref{fig:qemu_comms}, alongside typical output to the screen and keyboard input, the kernel transmits text to the QEMU serial port, which is then streamed to the host shell. This approach enables host-side scripts to parse logs, detect bugs or failures, and allow manual inspection of the system state immediately preceding a crash. +% As shown in Figure~\ref{fig:qemu_comms}, alongside typical output to the screen and keyboard input, the kernel transmits text to the QEMU serial port, which is then streamed to the host shell. This approach enables host-side scripts to parse logs, detect bugs or failures, and allow manual inspection of the system state immediately preceding a crash. \section{Enabling Interrupts and Exceptions} \label{subsec:os-tutorial-interrupts} -Prior to the implementation of memory management (Section \ref{subsubsec:physical_memory_management}), it is essential to establish a basic interrupt handling mechanism. On most platforms, the exception system relies entirely on interrupt mappings. Consequently, without a functional interrupts, the kernel is unable to display debug information on the designated communication device when an error occurs. +Prior to the implementation of memory management (Section~\ref{subsubsec:physical_memory_management}), it is essential to establish a basic interrupt handling mechanism. On most platforms, the exception system relies entirely on interrupt mappings. Consequently, without functional interrupts, the kernel is unable to display debug information on the designated communication device when an error occurs. Exception handlers should output a descriptive message detailing the failure. This message must include the CPU state, the source of the exception, the instruction pointer where the error occurred, and, if applicable, an error code explaining the cause. An example of an x86-64 kernel dump following an exception is presented below: @@ -1204,13 +1165,13 @@ \subsection{Design Considerations} \item \textbf{Software Interrupts} --- Initiated by software instructions. For example, on the x86-64 architecture, the instruction \texttt{INT 0x80} triggers an interrupt with the vector number \texttt{0x80}. \end{itemize} -Architecture-specific details must also be considered. For example, the x86-64 architecture utilizes the legacy PIC \cite{osdev-pic} and the improved SMP-supporting APIC \cite{osdev-apic}, and it would be wise to support both somehow. +Architecture-specific details must also be considered. For example, the x86-64 architecture utilizes the legacy PIC~\cite{osdev-pic} and the improved SMP-supporting APIC~\cite{osdev-apic}, and it would be wise to support both somehow. Finally, to enable Symmetric Multiprocessing (SMP) and utilize multiple cores, the interrupt mechanism serves as the primary method of inter-core communication. \section{Tracing System} \label{sec:tracing} -As discussed in Section \ref{subsec:os-tutorial-interrupts}, direct tracing inside interrupt handlers is generally discouraged due to latency concerns. However, in a general context, kernel logs and debug messages must be preserved to a file or terminal. The challenge is that writing to a file or physical device incurs significant latency, while system code must execute as rapidly as possible. To resolve this, the tracing framework must be robust enough to operate in a concurrent environment (handling interrupts and SMP) while decoupling the generation of traces from their output. +As discussed in Section~\ref{subsec:os-tutorial-interrupts}, direct tracing inside interrupt handlers is generally discouraged due to latency concerns. However, in a general context, kernel logs and debug messages must be preserved to a file or terminal. The challenge is that writing to a file or physical device incurs significant latency, while system code must execute as rapidly as possible. To resolve this, the tracing framework must be robust enough to operate in a concurrent environment (handling interrupts and SMP) while decoupling the generation of traces from their output. This can be achieved by allocating a large circular buffer for messages, which is then asynchronously flushed to the output device by a dedicated, low-priority task. Furthermore, to facilitate effective debugging in a multi-threaded and multi-core environment, each log entry may automatically capture essential context metadata, specifically the high-precision timestamp, the current Core ID, or the Process ID. @@ -1221,7 +1182,7 @@ \section{Testing framework} The reliability of an operating system kernel cannot be guaranteed solely through compilation checks or host-based logic verification. While some algorithmic components can be tested in a hosted environment (e.g., on a Linux host), the core kernel functionalities depend heavily on the specific hardware architecture, physical memory layout, and privileged CPU instructions. Consequently, a robust testing framework must be implemented to execute directly within the kernel on the target hardware or emulator. -This in-kernel framework should function similarly to user-space unit testing libraries but operates within the kernel. It allows developers to define test cases for critical subsystems, such as memory allocators, intrusive data structures, and scheduling algorithms, and execute them in the actual runtime environment. This approach is the only way to detect architecture-specific issues, such as unaligned memory accesses, incorrect register usage during context switches, or paging faults caused by invalid table entries, which would remain undetected in a hosted simulation or simply impossible to test. +This in-kernel framework should function similarly to user space unit testing libraries but operates within the kernel. It allows developers to define test cases for critical subsystems, such as memory allocators, intrusive data structures, and scheduling algorithms, and execute them in the actual runtime environment. This approach is the only way to detect architecture-specific issues, such as unaligned memory accesses, incorrect register usage during context switches, or paging faults caused by invalid table entries, which would remain undetected in a hosted simulation or simply impossible to test. A critical design requirement for such a framework is test isolation. In a kernel environment, a failure often results in a system panic, infinite loop, or silent memory corruption that compromises the global state. To prevent side effects from a failed test influencing subsequent assertions (creating false positives or negatives), the framework should ideally ensure a clean state for each test suite. In early development stages, this often necessitates implementing mechanisms to reset subsystems between test executions, ensuring that results are deterministic and reproducible. @@ -1231,7 +1192,7 @@ \section{Physical Memory Management} The Physical Memory Manager (PMM) constitutes the foundational layer of the operating system's memory subsystem. Its primary responsibility is the accounting of the machine's finite Random Access Memory (RAM). In a monolithic kernel with virtual memory support, the PMM typically operates at two distinct levels of granularity: \begin{enumerate} -\item \textbf{Page Frame Allocation:} The allocation of raw, contiguous physical memory in fixed-size units called \textit{page frames} (typically 4096 bytes on x86-64). This is primarily used to back Virtual Memory mappings for user-space processes. +\item \textbf{Page Frame Allocation:} The allocation of raw, contiguous physical memory in fixed-size units called \textit{page frames} (typically 4096 bytes on x86-64). This is primarily used to back Virtual Memory mappings for user space processes. \item \textbf{Kernel Heap Allocation:} The sub-allocation of memory within those pages for the kernel's own internal data structures (e.g., thread control blocks, file descriptors). These requests vary wildly in size, from a few bytes to several kilobytes. \end{enumerate} @@ -1239,7 +1200,7 @@ \section{Physical Memory Management} \subsection{The Core Challenge: Fragmentation} -If an allocator simply handed out memory sequentially and never had to accept returns (frees), the implementation would be a trivial pointer increment. Complexity arises because memory is borrowed and returned in an arbitrary order. This leads to \textbf{fragmentation}: the inability to reuse memory that is technically free. Fragmentation manifests in two distinct forms \cite{wilson1995survey}: +If an allocator simply handed out memory sequentially and never had to accept returns (frees), the implementation would be a trivial pointer increment. Complexity arises because memory is borrowed and returned in an arbitrary order. This leads to \textbf{fragmentation}: the inability to reuse memory that is technically free. Fragmentation manifests in two distinct forms~\cite{wilson1995survey}: \begin{itemize} \item \textbf{Internal Fragmentation:} This occurs when the allocator assigns a block larger than what was requested. For example, if a request for 20 bytes is rounded up to a 32-byte block to satisfy alignment requirements, 12 bytes are wasted. This waste is "internal" to the allocated block and unusable by others. @@ -1252,10 +1213,10 @@ \subsection{The Core Challenge: Fragmentation} \subsection{Anatomy of an Allocator} -To understand how different allocators address these challenges, it is helpful to use the taxonomy proposed by Wilson et al. \cite{wilson1995survey}, which separates allocator design into three levels of abstraction: +To understand how different allocators address these challenges, it is helpful to use the taxonomy proposed by Wilson et al.~\cite{wilson1995survey}, which separates allocator design into three levels of abstraction: \subsubsection{1. Strategy} -The strategy is the high-level philosophy used to combat fragmentation. It relies on heuristics about program behavior. Research into allocation traces reveals three dominant patterns \cite{wilson1995survey}: +The strategy is the high-level philosophy used to combat fragmentation. It relies on heuristics about program behavior. Research into allocation traces reveals three dominant patterns~\cite{wilson1995survey}: \begin{itemize} \item \textbf{Ramps:} Monotonic accumulation of long-lived data (e.g., building a parse tree). @@ -1270,8 +1231,7 @@ \subsubsection{2. Policy} \begin{itemize} \item \textbf{First Fit:} Scan the free memory from the beginning and return the first block that is large enough. This is fast but can accumulate small "splinters" of free memory at the start of the list. \item \textbf{Best Fit:} Search the entire list to find the smallest block that satisfies the request. This minimizes the wasted remainder (unused space after the split) but requires a comprehensive search, which can be slow. - \item \textbf{Next Fit}: Resume searching from where the last allocation stopped. Scatters allocations and often -performs poorly + \item \textbf{Next Fit}: Resume searching from where the last allocation stopped. Scatters allocations and often performs poorly \end{itemize} \subsubsection{3. Mechanism} @@ -1362,7 +1322,7 @@ \subsection{Mechanism: Stack-Based Allocation (Free Lists)} \subsection{Mechanism: The Buddy System} -To solve the contiguity problem while maintaining decent performance, general-purpose operating systems (like Linux) utilize the \textbf{Buddy System}. The Buddy System is a specific implementation of a broader class of allocators known as \textbf{Segregated Fits}, and its theoretical foundations are thoroughly described by Knuth \cite[Section 2.5]{knuth1973art}. +To solve the contiguity problem while maintaining decent performance, general-purpose operating systems (like Linux) utilize the \textbf{Buddy System}. The Buddy System is a specific implementation of a broader class of allocators known as \textbf{Segregated Fits}, and its theoretical foundations are thoroughly described by Knuth \cite[Section~2.5]{knuth1973art}. \subsubsection{Concept: Segregated Fits} In a segregated fit architecture, memory is not viewed as one monolithic pool. Instead, the allocator maintains an array of free lists. Each list acts as a bin dedicated to blocks of a specific size class. For example, index 0 might hold 4KiB blocks, index 1 might hold 8KiB blocks, and so on. This allows the allocator to quickly locate a block that fits a request without searching through blocks that are vastly too large or too small. @@ -1425,13 +1385,13 @@ \subsubsection{The Binary Buddy Algorithm} \subsubsection{Why Buddy is Insufficient for Real-Time} While the Buddy System is efficient ($O(log N)$) and handles external fragmentation well via coalescing, it suffers from significant \textbf{Internal Fragmentation}. Because every request must be rounded up to a power of two, a request for 33 KiB requires a 64 KiB block, wasting 47\% of the allocated memory. -Furthermore, the coalescing logic is restrictive. Two adjacent free blocks of size $2^k$ cannot always be merged. They must be \textit{buddies} (i.e., aligned on a strict $2^{k+1}$ boundary). This can lead to cases where contiguous memory is available but unusable because the blocks are "cousins" rather than "buddies" \cite{wilson1995survey}. These limitations motivate the need for more granular allocators in real-time systems, such as TLSF. +Furthermore, the coalescing logic is restrictive. Two adjacent free blocks of size $2^k$ cannot always be merged. They must be \textit{buddies} (i.e., aligned on a strict $2^{k+1}$ boundary). This can lead to cases where contiguous memory is available but unusable because the blocks are "cousins" rather than "buddies"~\cite{wilson1995survey}. These limitations motivate the need for more granular allocators in real-time systems, such as TLSF. \subsection{Mechanism: Two-Level Segregated Fit (TLSF)} While the Buddy System is fast, its reliance on power-of-two sizing creates unacceptable internal fragmentation for workloads that frequently allocate objects of irregular sizes (e.g., 50 bytes or 800 bytes). Additionally, real-time systems require a strict guarantee of constant-time $O(1)$ performance, regardless of the fragmentation state of the heap. -The \textbf{Two-Level Segregated Fit (TLSF)} allocator \cite{masmano2004tlsf} was designed specifically to address these requirements. It is a segregated fit allocator (like Buddy) but with a much finer granularity of size classes and a more flexible coalescing strategy. +The \textbf{Two-Level Segregated Fit (TLSF)} allocator~\cite{masmano2004tlsf} was designed specifically to address these requirements. It is a segregated fit allocator (like Buddy) but with a much finer granularity of size classes and a more flexible coalescing strategy. \subsubsection{Addressing Internal Fragmentation: The Two-Level Index} To reduce internal fragmentation, an allocator needs size classes that are closer together than powers of two. However, having thousands of free lists (one for 10 bytes, one for 11, etc.) makes searching for a free block slow. TLSF solves this by organizing free lists into a two-dimensional matrix, denoted by indices $(f, s)$. @@ -1501,7 +1461,7 @@ \subsection{Advanced Layering: Object Caching (Slab)} Initializing these objects can be expensive. For example, initializing a kernel mutex involves setting internal flags, initializing wait queues, and perhaps interacting with hardware interrupt controllers. If an object is allocated, initialized, used, destroyed, and freed repeatedly, the CPU spends significant cycles just setting up and tearing down the same state. -The \textbf{Slab Allocator}, introduced by Jeff Bonwick for SunOS \cite{bonwick1994slab}, solves this by observing that the state of a freed object is often valid for the next allocation. It separates the concepts of \textit{memory release} and \textit{object destruction}. +The \textbf{Slab Allocator}, introduced by Jeff Bonwick for SunOS~\cite{bonwick1994slab}, solves this by observing that the state of a freed object is often valid for the next allocation. It separates the concepts of \textit{memory release} and \textit{object destruction}. \subsubsection{The Concept of Object Caching} A Slab Allocator organizes memory into caches of specific object types (e.g., a cache for \texttt{task\_structs}, a cache for \texttt{inodes}). @@ -1527,7 +1487,7 @@ \subsection{Scalability: Allocating on Multicore Systems} \subsubsection{The Hoard Allocator} A naive implementation of per-CPU heaps suffers from a specific pathology known as \textit{blowup}. Consider a "Producer-Consumer" pattern where Thread A (on CPU 1) continuously allocates packets, and Thread B (on CPU 2) consumes and frees them. CPU 1's heap constantly empties (requesting more from global), and CPU 2's heap constantly fills up (never returning to global). The system runs out of memory despite having ample free space, because that space is trapped on CPU 2. -The \textbf{Hoard Allocator} \cite{berger2000hoard} addresses this by organizing memory into \textit{Superblocks} --- large chunks of memory containing multiple objects of the same size. Hoard tracks the "emptiness" of these superblocks. If a superblock on a local heap becomes mostly empty, it is moved to the global heap, allowing other processors to reuse the memory. This guarantees that the memory consumption of the allocator is bounded within a constant factor of the ideal required memory, solving the blowup problem while maintaining scalability. +The \textbf{Hoard Allocator}~\cite{berger2000hoard} addresses this by organizing memory into \textit{Superblocks} --- large chunks of memory containing multiple objects of the same size. Hoard tracks the "emptiness" of these superblocks. If a superblock on a local heap becomes mostly empty, it is moved to the global heap, allowing other processors to reuse the memory. This guarantees that the memory consumption of the allocator is bounded within a constant factor of the ideal required memory, solving the blowup problem while maintaining scalability. \subsection{Hardware Constraints: Zoning} @@ -1546,9 +1506,9 @@ \section{Virtual Memory Management} \subsection{The Abstraction of Memory} \label{subsec:vmm_abstraction} -While the Physical Memory Manager (discussed in Section \ref{subsubsec:physical_memory_management}) is responsible for the accounting of raw storage resources, it provides no mechanisms for isolation, safety, or convenient addressing. In a system lacking a Memory Management Unit (MMU), software operates in a physical addressing mode often referred to as Identity Mapping. In such an environment, the address utilized by a machine instruction corresponds directly to the electrical signals asserted on the memory bus. While this model minimizes hardware complexity and is prevalent in embedded microcontrollers, it presents insurmountable challenges for a general-purpose operating system. +While the Physical Memory Manager (discussed in Section~\ref{subsubsec:physical_memory_management}) is responsible for the accounting of raw storage resources, it provides no mechanisms for isolation, safety, or convenient addressing. In a system lacking a Memory Management Unit (MMU), software operates in a physical addressing mode often referred to as Identity Mapping. In such an environment, the address utilized by a machine instruction corresponds directly to the electrical signals asserted on the memory bus. While this model minimizes hardware complexity and is prevalent in embedded microcontrollers, it presents insurmountable challenges for a general-purpose operating system. -In a physical addressing model, every application must share a single global address space. This tight coupling implies that a programming error in one task --- such as a buffer overflow or a wild pointer dereference --- can corrupt the data structures of another task or even the kernel itself, leading to immediate system instability. Furthermore, compiling software becomes arduous, as every program requires a unique load address to avoid collision, rendering the execution of multiple instances of the same program nearly impossible without complex, position-independent code generation. +In a physical addressing model, every application must share a single global address space. This tight coupling implies that a programming error in one task --- such as a buffer overflow or a wild pointer dereference --- can corrupt the data structures of another task or even the kernel itself, leading to immediate system instability. Furthermore, compiling software becomes arduous, as every program requires a unique load address to avoid collisions, rendering the execution of multiple instances of the same program nearly impossible without complex, position-independent code generation. To resolve these architectural limitations, modern general-purpose operating systems implement Virtual Memory (VM). Fundamentally, Virtual Memory is an abstraction layer that decouples the \textit{logical view} of memory from the \textit{physical reality} of the hardware. @@ -1556,7 +1516,7 @@ \subsection{The Abstraction of Memory} The primary focus in kernel design remains on the protection and organization capabilities of the architecture. The VMM is essential for: \begin{itemize} - \item \textbf{Isolation:} Enforcing privilege boundaries so that user-space applications cannot modify kernel memory or interfere with each other. + \item \textbf{Isolation:} Enforcing privilege boundaries so that user space applications cannot modify kernel memory or interfere with each other. \item \textbf{Flexibility:} Allowing programs to be linked to fixed virtual addresses regardless of their actual physical location or the amount of installed RAM. \item \textbf{Efficiency:} Enabling advanced mechanisms such as demand paging and Copy-on-Write (CoW), which optimize physical RAM usage by sharing data until modification is strictly necessary. \end{itemize} @@ -1747,7 +1707,7 @@ \subsubsection{The Cost of Translation: The TLB} For the operating system designer, the TLB represents a critical resource constraint. The TLB is finite and relatively small (often holding only a few thousand entries). If the working set of a process effectively accesses more pages than can fit in the TLB, the system enters a state of \textbf{TLB Thrashing}, where the CPU spends more cycles walking page tables than executing instructions. -This constraint motivates the support for \textbf{Huge Pages} (or Superpages). As analyzed by Navarro et al. \cite{navarro2002practical}, the concept of \textbf{TLB Reach} --- the total amount of memory accessible without incurring a TLB miss --- is a primary determinant of performance for memory-intensive applications. By mapping memory using 2 MiB or 1 GiB pages instead of 4 KiB pages, a single TLB entry can cover a significantly larger region of memory, reducing the frequency of misses. While implementing superpage support adds complexity to the Physical Memory Manager (requiring the allocation of physically contiguous blocks), the theoretical performance gains make understanding this hardware reality essential. +This constraint motivates the support for \textbf{Huge Pages} (or Superpages). As analyzed by Navarro et al.~\cite{navarro2002practical}, the concept of \textbf{TLB Reach} --- the total amount of memory accessible without incurring a TLB miss --- is a primary determinant of performance for memory-intensive applications. By mapping memory using 2 MiB or 1 GiB pages instead of 4 KiB pages, a single TLB entry can cover a significantly larger region of memory, reducing the frequency of misses. While implementing superpage support adds complexity to the Physical Memory Manager (requiring the allocation of physically contiguous blocks), the theoretical performance gains make understanding this hardware reality essential. \subsection{Software Architecture: Separation of Concerns} \label{subsec:vmm_architecture} @@ -1756,7 +1716,7 @@ \subsection{Software Architecture: Separation of Concerns} Hardware page tables are designed for the consumption of the MMU, not the operating system developer. They are architecture-specific, rigid in format, and lossy in nature. For example, a standard x86-64 page table entry contains bits for "Present", "Read/Write", and "User/Supervisor", but it has no fields to store high-level concepts such as "this page belongs to the file \texttt{libc.so}" or "this page is a guard page for a thread stack." Furthermore, querying the state of the address space using only page tables --- such as finding a contiguous free region of virtual memory --- requires walking a sparse, multi-level tree, which is algorithmically inefficient ($O(N)$ scanning of a potentially massive structure). -To resolve these issues and ensure portability, a robust Virtual Memory Manager (VMM) should adopt the architectural split pioneered by the Mach microkernel \cite{rashid1987machine}. This design separates the memory subsystem into two distinct layers: the Machine Independent (MI) layer and the Machine Dependent (MD) layer. +To resolve these issues and ensure portability, a robust Virtual Memory Manager (VMM) should adopt the architectural split pioneered by the Mach microkernel~\cite{rashid1987machine}. This design separates the memory subsystem into two distinct layers: the Machine Independent (MI) layer and the Machine Dependent (MD) layer. \subsubsection{The Machine Independent Layer (VM Maps)} @@ -1820,7 +1780,7 @@ \subsubsection{The Machine Dependent Layer (Physical Map)} \subsection{Data Structures: Managing the Address Space} \label{subsec:vmm_data_structures} -Once the Virtual Memory Manager is architecturally separated into machine-independent regions (as described in Section \ref{subsec:vmm_architecture}), the operating system faces a classic data structure problem. A process's address space is essentially a collection of non-overlapping regions, separated by unmapped "holes." +Once the Virtual Memory Manager is architecturally separated into machine-independent regions (as described in Section~\ref{subsec:vmm_architecture}), the operating system faces a classic data structure problem. A process's address space is essentially a collection of non-overlapping regions, separated by unmapped "holes." The kernel must perform two frequent operations on this collection: \begin{enumerate} @@ -2032,17 +1992,17 @@ \subsubsection{Swapping} \section{Discovering External Devices and System Capabilities} \label{sec:discovery} -Upon the successful initialization of the fully functional memory management subsystem, the kernel can proceed to discover external devices and implement subsystems for capabilities such as storage, input, and networking. It is not necessary to support an exhaustive list of peripherals to achieve a minimal working kernel. Therefore, this section covers only the absolute minimum set of devices required to support the scheduler and execute user-space programs. +Upon the successful initialization of the fully functional memory management subsystem, the kernel can proceed to discover external devices and implement subsystems for capabilities such as storage, input, and networking. It is not necessary to support an exhaustive list of peripherals to achieve a minimal working kernel. Therefore, this section covers only the absolute minimum set of devices required to support the scheduler and execute user space programs. \subsection{Discovery and Abstraction (Drivers)} Device discovery strategies vary fundamentally between hardware architectures, necessitating platform-specific implementations that feed into a unified kernel abstraction. -On desktop and server platforms, particularly \textbf{x86-64}, the hardware is designed to be self-describing. The \textbf{ACPI} subsystem \cite{ACPI-spec} is utilized to enumerate core platform components (such as CPU cores, LAPICs, and IOAPICs) and handle power management. For peripheral devices, the \textbf{PCI/PCIe} bus \cite{osdev-pci} provides a dynamic enumeration mechanism. The kernel can probe the PCI bus, read vendor and device IDs from configuration space, and automatically load the appropriate drivers without prior knowledge of the hardware topology. +On desktop and server platforms, particularly \textbf{x86-64}, the hardware is designed to be self-describing. The \textbf{ACPI} subsystem~\cite{ACPI-spec} is utilized to enumerate core platform components (such as CPU cores, LAPICs, and IOAPICs) and handle power management. For peripheral devices, the \textbf{PCI/PCIe} bus~\cite{osdev-pci} provides a dynamic enumeration mechanism. The kernel can probe the PCI bus, read vendor and device IDs from configuration space, and automatically load the appropriate drivers without prior knowledge of the hardware topology. In contrast, embedded architectures such as \textbf{ARM64} and \textbf{RISC-V}, particularly in System-on-Chip (SoC) implementations, often rely on simple system buses (e.g., AXI, AHB) that lack discovery capabilities. A UART controller or a Timer on an SoC is simply mapped to a fixed physical address. The kernel cannot "ask" the hardware if a device exists at a specific address. Attempting to read from an unmapped address could result in a bus error or system hang. -To address this, these architectures utilize the \textbf{Device Tree} (DTB) mechanism \cite{Devicetree-Spec}. Instead of hardcoding memory addresses into the kernel source (which would require a unique kernel binary for every specific motherboard), the bootloader passes a binary data structure to the kernel at boot. This blob describes the hardware topology, including memory-mapped addresses, interrupt lines, and dependency relationships. The kernel parses this tree to "discover" devices that are physically present but electronically silent. +To address this, these architectures utilize the \textbf{Device Tree} (DTB) mechanism~\cite{Devicetree-Spec}. Instead of hardcoding memory addresses into the kernel source (which would require a unique kernel binary for every specific motherboard), the bootloader passes a binary data structure to the kernel at boot. This blob describes the hardware topology, including memory-mapped addresses, interrupt lines, and dependency relationships. The kernel parses this tree to "discover" devices that are physically present but electronically silent. Regardless of whether a device is found via PCI enumeration, ACPI tables, or Device Tree parsing, the architecture-independent kernel must provide a unified interface for device registration. This ensures that the upper layers of the operating system remain agnostic to how the hardware was detected. An example driver interface, designed to abstract these differences, is presented below: @@ -2074,27 +2034,27 @@ \subsection{Core Local Storage} Fundamentally, Core Local Storage (CLS) enables the kernel to maintain state unique to each processing unit, such as the currently executing thread, scheduler queues, and performance counters, without the contention and latency costs associated with global synchronization primitives. To achieve this efficiently, the hardware architecture must provide a low-latency mechanism to retrieve the address of these structures. While implementation details vary, most modern architectures dedicate a specific register for this purpose. For instance, ARM64 utilizes the specialized \texttt{TPIDR\_EL1}\cite{AArch64-cls} system register, whereas RISC-V reserves the \texttt{tp}\cite{RISCV-registers} (Thread Pointer) general-purpose register by convention. -On the x86-64 architecture, this is typically implemented using the \texttt{GS} or \texttt{FS} segment registers. Since the full 64-bit base address cannot be loaded into these segment selectors directly, the kernel must utilize Model Specific Registers (specifically \texttt{IA32\_GS\_BASE} or \texttt{IA32\_KERNEL\_GS\_BASE} \cite{Intel-segment-loading}) to map the per-core structure to the logical address space. The \texttt{SWAPGS}\cite{osdev-gs} instruction is then employed during interrupt entry and exit to switch between user-space and kernel-space thread-local storage transparently. +On the x86-64 architecture, this is typically implemented using the \texttt{GS} or \texttt{FS} segment registers. Since the full 64-bit base address cannot be loaded into these segment selectors directly, the kernel must utilize Model Specific Registers (specifically \texttt{IA32\_GS\_BASE} or \texttt{IA32\_KERNEL\_GS\_BASE}~\cite{Intel-segment-loading}) to map the per-core structure to the logical address space. The \texttt{SWAPGS}~\cite{osdev-gs} instruction is then employed during interrupt entry and exit to switch between user space and kernel space thread-local storage transparently. \subsection{Core Identification} -A critical design challenge in this domain is the identification of the executing core. While mostly all processors provide unique hardware identifiers (e.g., APIC ID on x86 \cite{Intel-proc-ids} or MPIDR on ARM \cite{ARM-MPIDR-EL1}), these values are intrinsically tied to the physical topology of the silicon. Consequently, they are often non-contiguous, sparse, or hierarchical, rendering them unsuitable for directly indexing internal kernel arrays. To address this, an operating system typically implements a mapping layer during hardware discovery, assigning a dense, sequential \textbf{Logical ID} (ranging from 0 to $N-1$) to each active core. Generally speaking, another abstraction is handy here. +A critical design challenge in this domain is the identification of the executing core. While mostly all processors provide unique hardware identifiers (e.g., APIC ID on x86~\cite{Intel-proc-ids} or MPIDR on ARM~\cite{ARM-MPIDR-EL1}), these values are intrinsically tied to the physical topology of the silicon. Consequently, they are often non-contiguous, sparse, or hierarchical, rendering them unsuitable for directly indexing internal kernel arrays. To address this, an operating system typically implements a mapping layer during hardware discovery, assigning a dense, sequential \textbf{Logical ID} (ranging from 0 to $N-1$) to each active core. Generally speaking, another abstraction is handy here. \subsection{Timing Design} -To support the preemptive scheduling model, the kernel requires a robust timing infrastructure to track execution time. Before discussing timing devices, a fundamental design decision must be made regarding the kernel type: \texttt{tickless} or \texttt{ticking}. Ticking kernels rely on a global constant defining the frequency of periodic timing interrupts. These values typically operate with millisecond granularity, as higher frequencies would degrade system performance. In contrast, tickless kernels do not define periodic interrupts. Instead, they schedule the next timing event based on immediate requirements, such as the sleep schedule or the time slice of the currently running process. It is important to note that ticking kernels also require higher granularity interrupts if support for \texttt{nanosleep} is needed (e.g., Linux implementation: \ref{subsec:linux-timing}). +To support the preemptive scheduling model, the kernel requires a robust timing infrastructure to track execution time. Before discussing timing devices, a fundamental design decision must be made regarding the kernel type: \texttt{tickless} or \texttt{ticking}. Ticking kernels rely on a global constant defining the frequency of periodic timing interrupts. These values typically operate with millisecond granularity, as higher frequencies would degrade system performance. In contrast, tickless kernels do not define periodic interrupts. Instead, they schedule the next timing event based on immediate requirements, such as the sleep schedule or the time slice of the currently running process. It is important to note that ticking kernels also require higher granularity interrupts if support for \texttt{nanosleep} is needed (e.g., Linux implementation:~\ref{subsec:linux-timing}). \subsection{Clocks} \label{subsec:clocks-tutorial} -The first category of devices required is a simple clock capable of tracking the time elapsed since system startup, primarily for gathering statistics and maintaining timekeeping. These are essential for placing tasks on a timeline and scheduling interrupts for high-precision sleep operations. Without such capabilities, waking tasks would require iterating over all sleeping entities to decrement remaining time, resulting in $O(n)$ complexity. Since interrupt handlers must execute rapidly, tasks should be managed in a high-performance timeline-based priority queue. On x86-64, the TSC clock \cite{IntelManual-TSC} is suitable for measuring system uptime (sufficient for the scheduler), while the RTC is used for wall-clock time displayed to the user. If multiple timers are available, the kernel should select the one offering the best performance-accuracy ratio, obviously from the set of supported ones. +The first category of devices required is a simple clock capable of tracking the time elapsed since system startup, primarily for gathering statistics and maintaining timekeeping. These are essential for placing tasks on a timeline and scheduling interrupts for high-precision sleep operations. Without such capabilities, waking tasks would require iterating over all sleeping entities to decrement remaining time, resulting in $O(n)$ complexity. Since interrupt handlers must execute rapidly, tasks should be managed in a high-performance timeline-based priority queue. On x86-64, the TSC clock~\cite{IntelManual-TSC} is suitable for measuring system uptime (sufficient for the scheduler), while the RTC is used for wall-clock time displayed to the user. If multiple timers are available, the kernel should select the one offering the best performance-accuracy ratio, obviously from the set of supported ones. \subsection{Event Clocks} \label{subsec:event_clocks} -At this stage, the approach to timing event logic must be established, depending on the target architecture and the capabilities exposed to users. Crucially, to implement preemption as discussed in Section \ref{subsec:scheduling_approaches}, these devices are required to forcibly interrupt the currently running task when its time slice expires. +At this stage, the approach to timing event logic must be established, depending on the target architecture and the capabilities exposed to users. Crucially, to implement preemption as discussed in Section~\ref{subsec:scheduling_approaches}, these devices are required to forcibly interrupt the currently running task when its time slice expires. -The first consideration is sleep granularity. A standard precision of approximately one millisecond is achievable on nearly every architecture using simple periodic interrupts. However, implementing high-precision nanosecond or microsecond sleep requires timer devices capable of extremely rapid reconfiguration. The minimum time interval between consecutive events is strictly limited by the hardware configuration latency. The kernel cannot schedule an interrupt to occur sooner than the time required to program the device. Older timing hardware often requires slow I/O port interactions, rendering them unsuitable for high-precision scheduling, whereas modern devices utilize faster interfaces. Additionally, it is common for a single hardware component to serve a dual purpose, acting as both a timekeeping source (Clock \ref{subsec:clocks-tutorial}) and an interrupt generator (Event Clock). +The first consideration is sleep granularity. A standard precision of approximately one millisecond is achievable on nearly every architecture using simple periodic interrupts. However, implementing high-precision nanosecond or microsecond sleep requires timer devices capable of extremely rapid reconfiguration. The minimum time interval between consecutive events is strictly limited by the hardware configuration latency. The kernel cannot schedule an interrupt to occur sooner than the time required to program the device. Older timing hardware often requires slow I/O port interactions, rendering them unsuitable for high-precision scheduling, whereas modern devices utilize faster interfaces. Additionally, it is common for a single hardware component to serve a dual purpose, acting as both a timekeeping source (Clock~\ref{subsec:clocks-tutorial}) and an interrupt generator (Event Clock). Another critical factor is the support for multiple cores. In an SMP environment, it is preferable to utilize timing devices that reside locally on the cores and operate independently, thereby maximizing performance. While most modern architectures provide such devices (e.g., LAPIC Timer on x86-64), some older or embedded SMP platforms may lack them. In such cases, the kernel must utilize a global timer and employ interrupt-driven inter-core communication to simulate local interrupts. @@ -2108,11 +2068,11 @@ \subsection{Event Clocks} \item SMP non-local periodic-only timer $\rightarrow$ owner core of timer interrupts propagates the interrupt to target cores + low precision msleep only available. \end{itemize} -Most modern architectures support both local and high-precision timers. For example, on x86-64 with SMP, the presence of local LAPIC Timers \cite{IntelManual-APIC} (high precision, one-shot) is guaranteed, significantly simplifying the implementation. +Most modern architectures support both local and high-precision timers. For example, on x86-64 with SMP, the presence of local LAPIC Timers~\cite{IntelManual-APIC} (high precision, one-shot) is guaranteed, significantly simplifying the implementation. \subsection{Keyboard} -Lastly, we need some way to provide input to the kernel on the real machine, not only emulated environment. For that, we will need some keyboard handling. First easier way is to utilize some simple ports like PS/2 \cite{osdev-ps21, osdev-ps22} if they are available on our device (it is quite popular on embedded systems). Otherwise, we will probably have to implement some USB drivers, which will be much more difficult \cite{usb20-spec, osdev-usb}. +Lastly, we need some way to provide input to the kernel on the real machine, not only emulated environment. For that, we will need some keyboard handling. First easier way is to utilize some simple ports like PS/2~\cite{osdev-ps21, osdev-ps22} if they are available on our device (it is quite popular on embedded systems). Otherwise, we will probably have to implement some USB drivers, which will be much more difficult~\cite{usb20-spec, osdev-usb}. \section{File Systems} \label{sec:theory_fs} @@ -2123,7 +2083,7 @@ \section{File Systems} \subsection{Virtual File System (VFS)} -A non-trivial kernel must support multiple file system formats (e.g., FAT32 \cite{fat-spec} for UEFI compatibility, Ext2 \cite{ext2-spec} for Linux compatibility, or ISO9660 \cite{iso9660-spec} for optical media). Implementing system calls like \texttt{open()} or \texttt{read()} separately for each format would lead to a tight coupling between the kernel core and specific driver implementations. +A non-trivial kernel must support multiple file system formats (e.g., FAT32~\cite{fat-spec} for UEFI compatibility, Ext2~\cite{ext2-spec} for Linux compatibility, or ISO9660~\cite{iso9660-spec} for optical media). Implementing system calls like \texttt{open()} or \texttt{read()} separately for each format would lead to a tight coupling between the kernel core and specific driver implementations. To resolve this, modern kernels implement a \textbf{Virtual File System} (VFS). The VFS is a kernel subsystem that provides a unified abstract interface for file operations. It defines generic structures --- such as file metadata, directory entries, and mounted file systems --- and a set of operations that concrete file system drivers must implement. By using a function pointer table, the VFS decouples the high-level system calls from low-level storage details. @@ -2221,7 +2181,7 @@ \subsection{Software Rendering and Console Emulation} \subsection{The Concurrency Problem} -In a multitasking operating system, the display represents a shared resource. If multiple processes were granted direct write access to the physical framebuffer simultaneously, the screen would become a chaotic collage of uncoordinated pixel updates. Furthermore, direct access to physical memory mapped I/O by user-space applications poses a significant stability and security risk. +In a multitasking operating system, the display represents a shared resource. If multiple processes were granted direct write access to the physical framebuffer simultaneously, the screen would become a chaotic collage of uncoordinated pixel updates. Furthermore, direct access to physical memory mapped I/O by user space applications poses a significant stability and security risk. To address this, the operating system must virtualize the display hardware, revoking direct access from user applications and enforcing a strict ownership model. @@ -2254,7 +2214,7 @@ \subsubsection{Cooperative Scheduling} \end{itemize} \subsubsection{Preemptive Scheduling} -Preemptive scheduling is the standard for modern general-purpose operating systems. In this model, the kernel assigns a specific time slice to each task. If a task does not yield voluntarily before its time expires, the hardware generates an interrupt (driven by an Event Clock \ref{subsec:event_clocks}), returning control to the kernel. The scheduler then forcibly saves the current task's state and switches to another. +Preemptive scheduling is the standard for modern general-purpose operating systems. In this model, the kernel assigns a specific time slice to each task. If a task does not yield voluntarily before its time expires, the hardware generates an interrupt (driven by an Event Clock~\ref{subsec:event_clocks}), returning control to the kernel. The scheduler then forcibly saves the current task's state and switches to another. \begin{itemize} \item \textbf{Advantages:} Responsiveness and stability. No single task can monopolize the CPU, ensuring that the system remains responsive even if a user program hangs. @@ -2275,7 +2235,7 @@ \subsection{Process} While additional resources may be required to implement features such as Inter-Process Communication (IPC), pipes, file operations, or synchronization primitives, the two elements listed above constitute the bare minimum for the process concept. -A straightforward implementation involves defining separate structures for \texttt{Thread} and \texttt{Process} objects, although some designs merge these into a single task structure (as discussed in Section \ref{subsec:linux-task}). +A straightforward implementation involves defining separate structures for \texttt{Thread} and \texttt{Process} objects, although some designs merge these into a single task structure (as discussed in Section~\ref{subsec:linux-task}). At this stage, it is prudent to prepare data structures that allow for efficient querying of processes, facilitating operations such as termination, joining, or IPC. @@ -2365,7 +2325,7 @@ \subsubsection{Context Creation} \subsubsection{Context Conversion} -Before a context switch can occur, an initial context must exist. This implies the creation of the first process and the preparation of all associated structures described previously, utilizing the context initialization logic \ref{subsec:context-init}. Several approaches exist to achieve this. For example, the kernel may spawn an initial user-space program (init) responsible for booting other programs, or it may create a kernel-space process to perform further system initialization. +Before a context switch can occur, an initial context must exist. This implies the creation of the first process and the preparation of all associated structures described previously, utilizing the context initialization logic~\ref{subsec:context-init}. Several approaches exist to achieve this. For example, the kernel may spawn an initial user space program (init) responsible for booting other programs, or it may create a kernel space process to perform further system initialization. \subsubsection{Context Switch} @@ -2400,12 +2360,12 @@ \subsection{Scheduler Internal Synchronization} \begin{itemize} \item \textbf{Interrupt Synchronization} --- In a preemptive scheduling model, interrupts may induce context switches at arbitrary points. Even in non-preemptive scenarios, interrupt handlers may interact with shared kernel data. Consequently, critical sections must be protected against race conditions caused by asynchronous interrupt execution. -\item \textbf{SMP Synchronization} --- In a multiprocessor environment, concurrent access to shared data by multiple cores necessitates rigorous protection. For short critical sections, spinlocks \cite{spinlock} or memory barriers \cite{barriers} are suitable. However, for complex scenarios requiring extended waits, blocking locks (utilizing wait queues \ref{subsec:wait_queue}) are required. +\item \textbf{SMP Synchronization} --- In a multiprocessor environment, concurrent access to shared data by multiple cores necessitates rigorous protection. For short critical sections, spinlocks~\cite{spinlock} or memory barriers~\cite{barriers} are suitable. However, for complex scenarios requiring extended waits, blocking locks (utilizing wait queues~\ref{subsec:wait_queue}) are required. \end{itemize} \subsection{Scheduler} -With thread and process management structures established, the final component required is the mechanism responsible for selecting the next task for execution during preemption or voluntary yielding. The primary responsibility of the scheduler is to track the execution state of threads and determine the optimal task to run according to a specific policy. The complexity of the implementation is largely dictated by the chosen scheduling paradigm (Section \ref{subsec:scheduling_approaches}). +With thread and process management structures established, the final component required is the mechanism responsible for selecting the next task for execution during preemption or voluntary yielding. The primary responsibility of the scheduler is to track the execution state of threads and determine the optimal task to run according to a specific policy. The complexity of the implementation is largely dictated by the chosen scheduling paradigm (Section~\ref{subsec:scheduling_approaches}). The simplest approach is a single-core, non-preemptive (cooperative) scheduler. In this model, synchronization concerns within the scheduler are minimal, as context switches occur only at explicit yield points. Furthermore, the sleeping mechanism can be simplified, as tasks are not preempted, allowing wake-up checks to occur solely during yield calls without strict timing reliance. @@ -2425,7 +2385,7 @@ \subsection{Scheduler} No single scheduler can be ideal in all metrics. Therefore, a balanced solution suitable for the system's specific purpose is required. -Furthermore, task prioritization is essential. For instance, audio or block device drivers require immediate execution to ensure data consistency and user experience. To address this, hybrid policies (such as \texttt{SCHED\_FIFO} in Linux \ref{subsec:linux-scheduler}) allow critical tasks to operate in a cooperative manner, yielding only when necessary, while other tasks remain preemptible. +Furthermore, task prioritization is essential. For instance, audio or block device drivers require immediate execution to ensure data consistency and user experience. To address this, hybrid policies (such as \texttt{SCHED\_FIFO} in Linux~\ref{subsec:linux-scheduler}) allow critical tasks to operate in a cooperative manner, yielding only when necessary, while other tasks remain preemptible. To illustrate these concepts, we present the simplest form of a preemptive scheduler: a periodic, single-core Round Robin scheduler. This algorithm maintains a single queue of executable tasks. New tasks are appended to the tail of the queue, while the scheduler selects the task at the head for execution. At fixed intervals (e.g., every 10ms), the running task is interrupted, placed back at the tail of the queue, and the next task is dequeued. The following pseudocode demonstrates this logic: @@ -2471,7 +2431,7 @@ \subsection{Blocking - Wait Queues} Regardless of the scheduling algorithm employed, a mechanism is required to block threads and defer their execution until specific events occur, such as acquiring a lock, joining a thread, or awaiting I/O. This necessitates the management of multiple queues distributed throughout the kernel codebase, in contrast to the centralized run queues of the scheduler. -To distinguish between runnable and waiting threads, a specific \texttt{BLOCKED} state is assigned to tasks residing in a wait queue. This distinction simplifies resource management and structure cleanup. Intrusive data structures (Section \ref{subsec:intrusive}) are particularly advantageous in this context due to their efficient removal properties. The entity owning the wait queue is responsible for waking threads (transitioning them from the wait queue back to the scheduler's run queue) when the event occurs. Conversely, the task itself initiates the blocking process when it reaches a synchronization point. +To distinguish between runnable and waiting threads, a specific \texttt{BLOCKED} state is assigned to tasks residing in a wait queue. This distinction simplifies resource management and structure cleanup. Intrusive data structures (Section~\ref{subsec:intrusive}) are particularly advantageous in this context due to their efficient removal properties. The entity owning the wait queue is responsible for waking threads (transitioning them from the wait queue back to the scheduler's run queue) when the event occurs. Conversely, the task itself initiates the blocking process when it reaches a synchronization point. \subsection{Sleep} @@ -2479,13 +2439,13 @@ \subsection{Sleep} In a cooperative scheduling model, the wake-up logic can simply be integrated into the function responsible for selecting the next task. However, in preemptive and high-precision scenarios, the system must actively manage the timing framework to wake threads without introducing unnecessary latency. -For kernels utilizing a periodic tick, handling sleep operations with granularity aligned to the tick interval is straightforward. Waking up tasks can be performed on every timer interrupt. We just have to count how many times the interrupt fired (how many ticks have passed). Efficient algorithms, such as the timer wheel \cite{linux_timer_wheel}, exist to manage this type of coarse-grained sleeping. +For kernels utilizing a periodic tick, handling sleep operations with granularity aligned to the tick interval is straightforward. Waking up tasks can be performed on every timer interrupt. We just have to count how many times the interrupt fired (how many ticks have passed). Efficient algorithms, such as the timer wheel~\cite{linux_timer_wheel}, exist to manage this type of coarse-grained sleeping. In contrast, for a tickless kernel where interrupts are scheduled dynamically, the implementation is more intricate. This approach requires maintaining an execution timeline, typically implemented using a priority queue to sort tasks by their wake-up timestamps. For high-precision sleep (e.g., \texttt{nanosleep}), hardware constraints become a critical factor. If the requested sleep duration is smaller than the overhead required to interact with the timing module or reconfigure the hardware timer, busy-waiting may be a more efficient alternative. Designing such a framework requires careful consideration of timing constraints, such as the time consumed by timer reconfiguration and ensuring that the calculated wake-up time has not already passed during the configuration process. Additionally, whenever the timer is reprogrammed, the system must ensure that any tasks scheduled to wake up before the new interrupt time are correctly handled. \section{User Space} -With the foundational kernel components established, the development process transitions to enabling the execution of unprivileged tasks created by the user. To facilitate the execution of user-space programs, several key mechanisms and components must be implemented: +With the foundational kernel components established, the development process transitions to enabling the execution of unprivileged tasks created by the user. To facilitate the execution of user space programs, several key mechanisms and components must be implemented: \begin{itemize} \item \textbf{System Calls} --- A mechanism for communication with the kernel, providing a controlled method for privilege escalation. @@ -2499,7 +2459,7 @@ \section{User Space} \subsection{Syscalls} \label{subsubsec:syscalls} -System calls (syscalls) are the primary interface through which user space applications request services from the operating system kernel. Modern processors enforce a strict separation between unprivileged applications and the operating system kernel to ensure stability and security. This privilege separation is architecturally defined: x86-64 uses Protection Rings (Ring 3 for users, Ring 0 for kernel) \cite{IntelManual-Protection-Rings}, ARM64 uses Exception Levels (EL0 for users, EL1 for kernel) \cite{ARM-Arch-Ref-Exception-Levels}, and RISC-V uses Privilege Modes (User Mode and Supervisor Mode) \cite{RISCV-Priv-Spec-Privilege-Levels}. +System calls (syscalls) are the primary interface through which user space applications request services from the operating system kernel. Modern processors enforce a strict separation between unprivileged applications and the operating system kernel to ensure stability and security. This privilege separation is architecturally defined: x86-64 uses Protection Rings (Ring 3 for users, Ring 0 for kernel)~\cite{IntelManual-Protection-Rings}, ARM64 uses Exception Levels (EL0 for users, EL1 for kernel)~\cite{ARM-Arch-Ref-Exception-Levels}, and RISC-V uses Privilege Modes (User Mode and Supervisor Mode)~\cite{RISCV-Priv-Spec-Privilege-Levels}. A system call is a synchronous transition wherein an application explicitly requests a service from the kernel. This process involves privilege escalation, switching stacks, performing the operation, and safely returning to the unprivileged state. @@ -2519,15 +2479,15 @@ \subsection{Syscalls} There are two primary methods for implementing system calls, distinguished by the balance between simplicity and performance. \begin{enumerate} - \item \textbf{Software Interrupts:} Historically, systems used software interrupts (e.g., \texttt{int 0x80} on x86) to trigger syscalls. In this model, the CPU looks up a handler in the Interrupt Descriptor Table (IDT), performs extensive security checks (privilege verification, segment checks), and automatically pushes the execution context onto the kernel stack. + \item \textbf{Software Interrupts:} Historically, systems used software interrupts (e.g., \texttt{int 0x80} on x86) to trigger syscalls. In this model, the CPU looks up a handler in the Interrupt Descriptor Table~(IDT), performs extensive security checks (privilege verification, segment checks), and automatically pushes the execution context onto the kernel stack. While robust and easy to debug (as it unifies syscalls with exception handling), this mechanism is slow, typically consuming approximately 100--200 CPU cycles per call depending on the architecture. It is often the recommended starting point for new OS projects due to its implementation simplicity. \item \textbf{Fast System Call Instructions:} Modern architectures provide dedicated instructions optimized for low-latency transitions (approximately 30--50 cycles). These bypass the complex interrupt logic but require the kernel to perform more manual state management. \begin{itemize} \item \textbf{x86-64 (\texttt{syscall}/\texttt{sysret}):} The CPU loads the kernel entry point and code segments from Model-Specific Registers (MSRs). The CPU does not switch the stack pointer automatically. The kernel must explicitly save the user stack and load the kernel stack immediately upon entry. - \item \textbf{ARM64 (\texttt{svc}/\texttt{eret}):} The processor saves the state and return address into specific system registers (SPSR and ELR) \cite{ARM-Arch-Ref-Exception-Levels}. Unlike x86-64, ARM64 hardware automatically switches the stack pointer to the kernel's stack (\texttt{SP\_EL1}). - \item \textbf{RISC-V (\texttt{ecall}/\texttt{sret}):} Following a minimalist philosophy, the hardware does little more than jump to the trap vector and switch privilege modes \cite{RISCV-Priv-Spec-System-Call}. Specifically, the hardware automatically saves the return address to \texttt{mepc} (or \texttt{sepc}), records the exception cause in \texttt{mcause} (or \texttt{scause}), updates the privilege bits in \texttt{mstatus} (or \texttt{sstatus}), and disables interrupts. However, the kernel is responsible for saving all general-purpose registers and manually swapping the stack pointer, typically using a scratch register. + \item \textbf{ARM64 (\texttt{svc}/\texttt{eret}):} The processor saves the state and return address into specific system registers (SPSR and ELR)~\cite{ARM-Arch-Ref-Exception-Levels}. Unlike x86-64, ARM64 hardware automatically switches the stack pointer to the kernel's stack (\texttt{SP\_EL1}). + \item \textbf{RISC-V (\texttt{ecall}/\texttt{sret}):} Following a minimalist philosophy, the hardware does little more than jump to the trap vector and switch privilege modes~\cite{RISCV-Priv-Spec-System-Call}. Specifically, the hardware automatically saves the return address to \texttt{mepc} (or \texttt{sepc}), records the exception cause in \texttt{mcause} (or \texttt{scause}), updates the privilege bits in \texttt{mstatus} (or \texttt{sstatus}), and disables interrupts. However, the kernel is responsible for saving all general-purpose registers and manually swapping the stack pointer, typically using a scratch register. \end{itemize} \end{enumerate} @@ -2548,13 +2508,13 @@ \subsection{Syscalls} The syscall interface is the primary attack surface of the kernel. The kernel must treat all data from user space as untrusted. \begin{itemize} - \item \textbf{Pointer Validation:} The kernel must ensure that pointers passed by the user actually point to valid user space memory. Failing to perform this check allows a malicious program to trick the kernel into reading or overwriting its own internal structures (see the "Confused Deputy" problem \cite{confused_deputy_problem}). - \item \textbf{TOCTOU (Time-of-Check to Time-of-Use):} If the kernel validates data in user memory and reads it later, a separate thread could modify that data in between. For example, a kernel might check that a filename pointer is valid, but before it reads the actual filename, another thread could modify that memory to point to a privileged file. To prevent this, the kernel must copy data into kernel-space buffers \textit{before} validation and processing. + \item \textbf{Pointer Validation:} The kernel must ensure that pointers passed by the user actually point to valid user space memory. Failing to perform this check allows a malicious program to trick the kernel into reading or overwriting its own internal structures (see the "Confused Deputy" problem~\cite{confused_deputy_problem}). + \item \textbf{TOCTOU (Time-of-Check to Time-of-Use):} If the kernel validates data in user memory and reads it later, a separate thread could modify that data in between. For example, a kernel might check that a filename pointer is valid, but before it reads the actual filename, another thread could modify that memory to point to a privileged file. To prevent this, the kernel must copy data into kernel space buffers \textit{before} validation and processing. \end{itemize} \paragraph{Performance Optimization} -For high-frequency operations where even the minimal syscall overhead is too high (e.g., querying the system time), one may employ \textbf{vDSO} (virtual Dynamic Shared Object). This mechanism maps a read-only page of kernel memory containing specific data and code into the user process. The application can then execute a standard function call to read this data without ever triggering a context switch or entering kernel mode (e.g., linux \cite{linux_vdso}). +For high-frequency operations where even the minimal syscall overhead is too high (e.g., querying the system time), one may employ \textbf{vDSO} (virtual Dynamic Shared Object). This mechanism maps a read-only page of kernel memory containing specific data and code into the user process. The application can then execute a standard function call to read this data without ever triggering a context switch or entering kernel mode (e.g., linux~\cite{linux_vdso}). \subsection{Libc System Headers} \label{subsec:libc_headers} @@ -2564,12 +2524,12 @@ \subsection{Libc System Headers} In the context of custom operating system development, system headers are commonly structured into three distinct categories in order to ensure clarity, portability, and standards compliance. \paragraph{Standard ISO C Headers} -To enable existing software to compile and run on a new operating system, the \texttt{libc} implementation must provide all headers required by the ISO C standard \cite{iso_c_std}. These headers must strictly conform to the standardized function signatures and data types. +To enable existing software to compile and run on a new operating system, the \texttt{libc} implementation must provide all headers required by the ISO C standard~\cite{iso_c_std}. These headers must strictly conform to the standardized function signatures and data types. Internally, such functions act as abstraction layers over the kernel interface. The implementation of \texttt{fopen} typically allocates a user space data structure to represent the file stream and subsequently invokes the kernel's syscall to obtain a file descriptor or handle. This design effectively decouples application-level code from kernel-specific details such as syscall numbers, calling conventions, and ABI constraints. \paragraph{POSIX and System Headers} -In addition to the ISO C standard, many operating systems implement parts of the POSIX specification \cite{posix_std} to improve compatibility with existing UNIX-like software. Headers associated with POSIX functionality are commonly placed within the \texttt{sys/} directory (e.g., \texttt{} or \texttt{}). +In addition to the ISO C standard, many operating systems implement parts of the POSIX specification~\cite{posix_std} to improve compatibility with existing UNIX-like software. Headers associated with POSIX functionality are commonly placed within the \texttt{sys/} directory (e.g., \texttt{} or \texttt{}). \paragraph{OS-Specific Extensions} Custom operating systems frequently introduce features that fall outside the scope of ISO C or POSIX, such as direct frame buffer access, specialized inter-process communication mechanisms, or hardware-specific controls. @@ -2591,7 +2551,7 @@ \subsection{Root File System (Rootfs)} \item \textbf{Overlay:} Integration of static configuration files and resources into the staging area to complement the compiled executables. \item \textbf{File System Image Creation:} Conversion of the fully populated staging directory into a single binary image adhering to a chosen file system format. \end{itemize} -An overview of this pipeline is illustrated in Figure \ref{fig:rootfs_pipeline}. +An overview of this pipeline is illustrated in Figure~\ref{fig:rootfs_pipeline}. \begin{figure}[htbp] \centering @@ -2677,7 +2637,7 @@ \subsubsection{Runtime Initialization} \subsubsection{OS-Specific Toolchain Adaptation} \label{subsubsec:os-specific-toolchain} -While a generic cross-compiler (e.g., targeting a bare-metal environment such as \texttt{x86-64-elf}) is sufficient for compiling the kernel itself, it presents limitations for user space development. Generic toolchains are agnostic to the operating system's file system layout and standard library locations, requiring the build system to manually manage header paths and library linkage for every compilation unit. To streamline this process, the toolchain can be patched to target the custom operating system natively (e.g., via a target like \texttt{x86-64-youros}) \cite{osdev-os-specific-toolchain}. This automates environment detection and dependency resolution, significantly reducing reliance on complex build scripts. From this point onward, one should provide their own toolchain source code (or patches), alongside the OS source code, instead of relying on an upstream one. +While a generic cross-compiler (e.g., targeting a bare-metal environment such as \texttt{x86-64-elf}) is sufficient for compiling the kernel itself, it presents limitations for user space development. Generic toolchains are agnostic to the operating system's file system layout and standard library locations, requiring the build system to manually manage header paths and library linkage for every compilation unit. To streamline this process, the toolchain can be patched to target the custom operating system natively (e.g., via a target like \texttt{x86-64-youros})~\cite{osdev-os-specific-toolchain}. This automates environment detection and dependency resolution, significantly reducing reliance on complex build scripts. From this point onward, one should provide their own toolchain source code (or patches), alongside the OS source code, instead of relying on an upstream one. \paragraph{System Root (Sysroot)} @@ -2708,7 +2668,7 @@ \subsection{Intrusive Data Structures} A common solution to this problem is the use of \textbf{intrusive data structures}. In this paradigm, the pointers (such as \texttt{next} and \texttt{prev}) are embedded directly within the data object itself, rather than in an external node. This pattern is widely recognized in the C++ ecosystem and serves as the foundation for industry-standard libraries such as \textbf{Boost.Intrusive}\cite{boost_intrusive}. -By adopting this approach, the container manipulates pointers that already exist within the object structure. A typical implementation of this pattern using C++ templates is shown in Listing \ref{lst:intrusiveNodesExample}. +By adopting this approach, the container manipulates pointers that already exist within the object structure. A typical implementation of this pattern using C++ templates is shown in Listing~\ref{lst:intrusiveNodesExample}. \clearpage @@ -2983,7 +2943,7 @@ \chapter{Analysis of Existing Solutions} \section{Linux} \label{sec:linux-research} -The Linux operating system family is one of the most widely used globally, alongside Windows. The Linux kernel relies on a monolithic architecture. According to Tanenbaum, the monolithic approach is by far the most common organization for operating systems. In this model, the entire system runs as a single large executable binary program in kernel mode. This design allows for high efficiency, as any procedure within the kernel can directly call any other, facilitating rapid computation and minimal overhead. However, this lack of restriction leads to a system that can be 'unwieldy and difficult to understand'. A significant drawback of this is that a crash in any single procedure, such as a buggy device driver, can effectively take down the entire operating system \cite{tanenbaum2015modern}. +The Linux operating system family is one of the most widely used globally, alongside Windows. The Linux kernel relies on a monolithic architecture. According to Tanenbaum, the monolithic approach is by far the most common organization for operating systems. In this model, the entire system runs as a single large executable binary program in kernel mode. This design allows for high efficiency, as any procedure within the kernel can directly call any other, facilitating rapid computation and minimal overhead. However, this lack of restriction leads to a system that can be 'unwieldy and difficult to understand'. A significant drawback of this is that a crash in any single procedure, such as a buggy device driver, can effectively take down the entire operating system~\cite{tanenbaum2015modern}. \noindent Advantages: \begin{itemize} @@ -2996,11 +2956,11 @@ \section{Linux} \noindent Disadvantages: \begin{itemize} \item \textbf{Stability} --- A bug in a single driver, even for an unused device, can compromise the stability of the entire system. - \item \textbf{Security} --- Vulnerabilities in drivers can potentially expose the kernel address space to malicious user-space execution. + \item \textbf{Security} --- Vulnerabilities in drivers can potentially expose the kernel address space to malicious user space execution. \item \textbf{Scalability challenges} --- For large-scale projects, strict standards are necessary. Without them, the code base can easily become unmaintainable due to complex webs of references and dependencies. \end{itemize} -To enhance extensibility and avoid the need for recompilation when adding drivers, Linux employs Loadable Kernel Modules (LKMs) \cite{linux_lkm}. These allow the kernel functionality to be extended dynamically via well-defined interfaces during runtime. +To enhance extensibility and avoid the need for recompilation when adding drivers, Linux employs Loadable Kernel Modules (LKMs)~\cite{linux_lkm}. These allow the kernel functionality to be extended dynamically via well-defined interfaces during runtime. \begin{figure}[htbp] \centering @@ -3036,14 +2996,14 @@ \section{Linux} \subsection{Timing} \label{subsec:linux-timing} -Historically, Linux utilized a strictly periodic tick (defined by hardware), configuring hardware timers to generate interrupts at a fixed frequency (e.g., 100 Hz or 1000 Hz). While this design was straightforward, it imposed limitations on sleep granularity and power efficiency. To address these issues, Linux introduced High Resolution Timers (\texttt{hrtimers}), which allow the kernel to schedule interrupts with nanosecond precision based on immediate needs rather than a fixed cadence. Despite the capabilities of high-resolution timers, the concept of a periodic tick is maintained within the kernel for performance reasons and architectural legacy, by simulating the tick with high precision framework. +Historically, Linux utilized a strictly periodic tick (defined by hardware), configuring hardware timers to generate interrupts at a fixed frequency (e.g., 100 Hz or 1000 Hz). While this design was straightforward, it imposed limitations on sleep granularity and power efficiency. To address these issues, Linux introduced High Resolution Timers (\texttt{hrtimers}), which allow the kernel to schedule interrupts with nanosecond precision based on immediate needs rather than a fixed cadence. Despite the capabilities of high-resolution timers, the concept of a periodic tick is maintained within the kernel for performance reasons and architectural legacy, by simulating the tick with a high precision framework. -This hight-precision framework has one major drawback - it can get slow for a large number of events, as it is based on a priority queue with $O(\log n)$complexities. Relying exclusively on high-precision mechanisms for every timing event would introduce unacceptable overhead in scenarios involving a massive number of active timers. To mitigate this, Linux maintains the \texttt{jiffies} counter, which increments at the frequency of the system tick. This counter drives the Timer Wheel mechanism \cite{linux_timer_wheel}, a highly efficient algorithm designed for managing low-precision timeouts. The timer wheel offers $O(1)$ complexity for insertion and expiration, making it ideal for subsystems that require the management of thousands or even hundreds of thousands of concurrent events where nanosecond precision is unnecessary. A prime example is the networking stack, which must track timeouts for tens of thousands of open TCP connections. Utilizing high-resolution timers for such a volume would be computationally impossible. The \texttt{jiffies}-based timer wheel allows the system to handle these massive quantities of events with minimal CPU overhead. Consequently, standard API calls such as \texttt{msleep} continue to rely on this coarse-grained, high-performance infrastructure. +This high-precision framework has one major drawback - it can get slow for a large number of events, as it is based on a priority queue with $O(\log n)$ time complexities. Relying exclusively on high-precision mechanisms for every timing event would introduce unacceptable overhead in scenarios involving a massive number of active timers. To mitigate this, Linux maintains the \texttt{jiffies} counter, which increments at the frequency of the system tick. This counter drives the Timer Wheel mechanism~\cite{linux_timer_wheel}, a highly efficient algorithm designed for managing low-precision timeouts. The timer wheel offers $O(1)$ time complexity for insertion and expiration, making it ideal for subsystems that require the management of thousands or even hundreds of thousands of concurrent events where nanosecond precision is unnecessary. A prime example is the networking stack, which must track timeouts for tens of thousands of open TCP connections. Utilizing high-resolution timers for such a volume would be computationally impossible. The \texttt{jiffies}-based timer wheel allows the system to handle these massive quantities of events with minimal CPU overhead. Consequently, standard API calls such as \texttt{msleep} continue to rely on this coarse-grained, high-performance infrastructure. \subsection{The Process and Thread Model} \label{subsec:linux-task} -Unlike many other operating systems that distinguish strictly between processes (containers of resources) and threads (units of execution), Linux treats them almost identically. The core data structure is the \texttt{task\_struct} \cite{linux_task, bovet2005understanding}. +Unlike many other operating systems that distinguish strictly between processes (containers of resources) and threads (units of execution), Linux treats them almost identically. The core data structure is the \texttt{task\_struct}~\cite{linux_task, bovet2005understanding}. \begin{itemize} \item A \textbf{Process} is a \texttt{task\_struct} with a unique memory map and file descriptor table. \item A \textbf{Thread} is simply a \texttt{task\_struct} created via the \texttt{clone()} system call with flags such as \texttt{CLONE\_VM} and \texttt{CLONE\_FILES}, causing it to share the address space and resources with its parent. @@ -3052,7 +3012,7 @@ \subsection{The Process and Thread Model} \subsection{Scheduler} \label{subsec:linux-scheduler} -Linux implements several scheduling policies \cite{linux_policies}\cite{linux_cfs} operating on task priorities. There are two major classes: \textbf{Real Time}, which operates on priorities 1-99, and \textbf{Fair}, which operates on priority 0. Tasks with higher priorities preempt those with lower priorities. Linux provides six policy classes: +Linux implements several scheduling policies~\cite{linux_policies, linux_cfs} operating on task priorities. There are two major classes: \textbf{Real Time}, which operates on priorities 1-99, and \textbf{Fair}, which operates on priority 0. Tasks with higher priorities preempt those with lower priorities. Linux provides six policy classes: \begin{itemize} \item \texttt{SCHED\_DEADLINE} --- Takes precedence over any other policy and provides real-time capabilities. @@ -3063,7 +3023,7 @@ \subsection{Scheduler} \item \texttt{SCHED\_IDLE} --- Used for very low priority tasks that run only when the system is otherwise idle. \end{itemize} -From version 2.6.23 \cite{linux_cfs} up to 6.6 \cite{linux_eevdf}, the \texttt{CFS} (Completely Fair Scheduler) was the default scheduler for the \textbf{Fair} class. Starting with version 6.6, the \texttt{EEVDF} (Earliest Eligible Virtual Deadline First) scheduler began replacing CFS. +From version 2.6.23~\cite{linux_cfs} up to 6.6~\cite{linux_eevdf}, the \texttt{CFS} (Completely Fair Scheduler) was the default scheduler for the \textbf{Fair} class. Starting with version 6.6, the \texttt{EEVDF} (Earliest Eligible Virtual Deadline First) scheduler began replacing CFS. \subsubsection{Completely Fair Scheduler} The Linux documentation summarizes the design of the \textbf{CFS} as follows: @@ -3081,15 +3041,15 @@ \subsubsection{Completely Fair Scheduler} \subsubsection{Earliest Eligible Virtual Deadline First Scheduler} -This algorithm is becoming the new standard for Linux scheduling, addressing shortcomings of the previous \textbf{CFS} implementation. The new approach functions similarly to CFS but operates on deadlines rather than accumulated runtime \cite{linux_eevdf}. This modification allows for better prioritization of latency-sensitive tasks. +This algorithm is becoming the new standard for Linux scheduling, addressing shortcomings of the previous \textbf{CFS} implementation. The new approach functions similarly to CFS but operates on deadlines rather than accumulated runtime~\cite{linux_eevdf}. This modification allows for better prioritization of latency-sensitive tasks. \subsection{Memory Management} \label{subsec:linux_mem} -Linux features centralized memory management with various APIs, including \texttt{kmalloc}, \texttt{kzalloc}, \texttt{vmalloc}, and \texttt{kvmalloc} \cite{linux_mem_interfaces}. +Linux features centralized memory management with various APIs, including \texttt{kmalloc}, \texttt{kzalloc}, \texttt{vmalloc}, and \texttt{kvmalloc}~\cite{linux_mem_interfaces}. \subsubsection{Physical Memory} -As an architecture-independent kernel, Linux abstracts hardware details. Physical memory is partitioned into zones, each serving a distinct purpose \cite{linux_mem_phys}: +As an architecture-independent kernel, Linux abstracts hardware details. Physical memory is partitioned into zones, each serving a distinct purpose~\cite{linux_mem_phys}: \begin{itemize} \item \texttt{ZONE\_DMA} --- Memory suitable for DMA (Direct Memory Access) by devices that cannot access the full addressable range. @@ -3099,11 +3059,11 @@ \subsubsection{Physical Memory} \item \texttt{ZONE\_HIGHMEM} --- Memory not covered by permanent kernel mappings, used only by some 32-bit architectures. \end{itemize} -To enhance multi-core performance, Linux employs a two-step allocation strategy. It utilizes a global buddy allocator alongside Per-CPU Pagesets (PCP). According to Bovet and Cesati, these per-CPU caches are further divided into "hot" and "cold" lists to optimize hardware cache usage. "Hot" pages are assumed to be present in the CPU's L1/L2 cache and are preferred for operations where the CPU will immediately write to the page, whereas "cold" pages are used for DMA operations to avoid invalidating useful cache lines \cite{bovet2005understanding}. Allocations are first attempted from these local caches without locking, falling back to the global allocator only when necessary. +To enhance multi-core performance, Linux employs a two-step allocation strategy. It utilizes a global buddy allocator alongside Per-CPU Pagesets (PCP). According to Bovet and Cesati, these per-CPU caches are further divided into "hot" and "cold" lists to optimize hardware cache usage. "Hot" pages are assumed to be present in the CPU's L1/L2 cache and are preferred for operations where the CPU will immediately write to the page, whereas "cold" pages are used for DMA operations to avoid invalidating useful cache lines~\cite{bovet2005understanding}. Allocations are first attempted from these local caches without locking, falling back to the global allocator only when necessary. \subsubsection{Virtual Memory} -Each process possesses its own Page Table, responsible for mapping virtual addresses to physical ones. Physical frames are acquired from the Physical Memory Manager \ref{subsec:linux_mem}. Virtual address space management for each process utilizes red-black trees to efficiently locate free areas. The kernel address space is mapped into every process. Additionally, Linux implements demand paging, a lazy allocation approach where physical pages are assigned only when the process actually accesses the memory, utilizing the page-fault mechanism. +Each process possesses its own Page Table, responsible for mapping virtual addresses to physical ones. Physical frames are acquired from the Physical Memory Manager~\ref{subsec:linux_mem}. Virtual address space management for each process utilizes red-black trees to efficiently locate free areas. The kernel address space is mapped into every process. Additionally, Linux implements demand paging, a lazy allocation approach where physical pages are assigned only when the process actually accesses the memory, utilizing the page-fault mechanism. \subsection{System Interface} @@ -3136,9 +3096,9 @@ \subsubsection{Fork Mechanism} \section{Minix 3} \label{sec:minix-research} -Minix 3 is a microkernel-based operating system designed with a strong emphasis on high reliability. Unlike monolithic kernel architectures, which Tanenbaum argues are inherently prone to failure due to high bug density --- statistically between two and ten bugs per thousand lines of code \cite{tanenbaum2015modern} --- Minix 3 relocates the majority of operating system components into user space. This includes device drivers, file systems, and memory management mechanisms. As a result, the microkernel size is highly reduced, strictly limiting its responsibilities to essential functions such as interrupt handling, process management, and inter-process communication via message passing. +Minix 3 is a microkernel-based operating system designed with a strong emphasis on high reliability. Unlike monolithic kernel architectures, which Tanenbaum argues are inherently prone to failure due to high bug density --- statistically between two and ten bugs per thousand lines of code~\cite{tanenbaum2015modern} --- Minix 3 relocates the majority of operating system components into user space. This includes device drivers, file systems, and memory management mechanisms. As a result, the microkernel size is highly reduced, strictly limiting its responsibilities to essential functions such as interrupt handling, process management, and inter-process communication via message passing. -This architectural approach provides significant advantages in terms of fault tolerance. By running each device driver and server as a separate, relatively powerless user process, the kernel ensures that a bug in a single component, such as an audio driver, cannot crash the entire system. A key feature of this design is the "reincarnation server", which monitors system components and automatically replaces those that have failed, making the system effectively "self-healing" \cite{tanenbaum2015modern}. Furthermore, this modularity follows the principle of decoupling mechanism from policy. For example, while the kernel handles the mechanism of switching processes, the scheduling policy can be managed by user-mode processes. This ensures that each component has "exactly the power to do its work and nothing more" \cite{tanenbaum2015modern}, fundamentally enhancing system stability and maintainability. +This architectural approach provides significant advantages in terms of fault tolerance. By running each device driver and server as a separate, relatively powerless user process, the kernel ensures that a bug in a single component, such as an audio driver, cannot crash the entire system. A key feature of this design is the "reincarnation server", which monitors system components and automatically replaces those that have failed, making the system effectively "self-healing"~\cite{tanenbaum2015modern}. Furthermore, this modularity follows the principle of decoupling mechanism from policy. For example, while the kernel handles the mechanism of switching processes, the scheduling policy can be managed by user-mode processes. This ensures that each component has "exactly the power to do its work and nothing more"~\cite{tanenbaum2015modern}, fundamentally enhancing system stability and maintainability. \begin{figure}[htbp] \centering @@ -3203,44 +3163,44 @@ \section{Minix 3} \end{tikzpicture} }% -\caption{The Minix 3 Microkernel Architecture (adapted from \cite{minix_wikipedia})} +\caption{The Minix 3 Microkernel Architecture (adapted from~\cite{minix_wikipedia})} \label{fig:minix_architecture} \end{figure} \subsection{Scheduler} Adhering to the microkernel philosophy, Minix 3 separates scheduling policy from the low-level context switching mechanism. While the kernel remains responsible for the mechanics of context switching and interrupt handling, scheduling decisions are delegated to a dedicated user space server, referred to as the \texttt{Sched} server. -The scheduler employs a multi-level priority round-robin algorithm \cite{minix_sched}. The system maintains sixteen priority queues organized hierarchically. Kernel tasks occupy the highest priority levels, followed by device drivers, system servers, and user applications. The lowest priority queue is reserved exclusively for idle tasks. At any scheduling decision point, the scheduler selects the next runnable process from the highest non-empty priority queue. +The scheduler employs a multi-level priority round-robin algorithm~\cite{minix_sched}. The system maintains sixteen priority queues organized hierarchically. Kernel tasks occupy the highest priority levels, followed by device drivers, system servers, and user applications. The lowest priority queue is reserved exclusively for idle tasks. At any scheduling decision point, the scheduler selects the next runnable process from the highest non-empty priority queue. Every process is assigned a specific time quantum, representing the maximum CPU time allowed before preemption occurs. Upon the exhaustion of a process's time quantum, the kernel notifies the scheduler server. To penalize CPU-bound tasks and maintain system responsiveness, the scheduler lowers the priority of such processes, moving them to a lower queue. -To prevent indefinite starvation of low-priority processes, a \texttt{balance\_queues} function executes periodically at five-second intervals \cite{minix_sched_bq}. This function re-evaluates process states and promotes tasks that have not received CPU time for an extended period, thereby ensuring that interactive applications remain responsive. +To prevent indefinite starvation of low-priority processes, a \texttt{balance\_queues} function executes periodically at five-second intervals~\cite{minix_sched_bq}. This function re-evaluates process states and promotes tasks that have not received CPU time for an extended period, thereby ensuring that interactive applications remain responsive. \subsection{Timing} The timing subsystem in Minix 3 is responsible for maintaining system time, coordinating scheduling intervals, and handling alarms and timers. The core timing functionality is encapsulated within the Clock Task, a kernel-level process that manages time-related operations. \subsubsection{Hardware Abstraction} -Minix 3 provides an abstraction layer over platform-specific timing hardware. On x86 systems, this includes the legacy Programmable Interval Timer (PIT) \cite{osdev-pit}, while ARM-based platforms rely on board-specific timer implementations \cite{minix_arch_clock}. During system initialization, the kernel configures the selected hardware timer to generate periodic interrupts. If not specified in the kernel environment, the default interrupt frequency is architecture-dependent: 60 Hz on x86 and 1000 Hz on ARM systems. +Minix 3 provides an abstraction layer over platform-specific timing hardware. On x86 systems, this includes the legacy Programmable Interval Timer (PIT)~\cite{osdev-pit}, while ARM-based platforms rely on board-specific timer implementations~\cite{minix_arch_clock}. During system initialization, the kernel configures the selected hardware timer to generate periodic interrupts. If not specified in the kernel environment, the default interrupt frequency is architecture-dependent: 60 Hz on x86 and 1000 Hz on ARM systems. \subsubsection{Timers and Alarms} -The kernel maintains a \texttt{clock\_timers} queue to manage synchronous timers and alarms \cite{minix_kernel_clock}. Upon receipt of a hardware timer interrupt, the \texttt{timer\_int\_handler} routine is executed. If the expiration time of the next scheduled timer in the \texttt{clock\_timers} queue has been reached, the kernel executes the associated callback function. +The kernel maintains a \texttt{clock\_timers} queue to manage synchronous timers and alarms~\cite{minix_kernel_clock}. Upon receipt of a hardware timer interrupt, the \texttt{timer\_int\_handler} routine is executed. If the expiration time of the next scheduled timer in the \texttt{clock\_timers} queue has been reached, the kernel executes the associated callback function. -In the case of system alarms, this callback is the \texttt{cause\_alarm} function \cite{minix_do_setalarm}. This function sends a notification message from the Clock Task to the requesting process (e.g., the Process Manager or Scheduler), informing it that the requested time interval has elapsed. This mechanism allows user space servers to perform periodic tasks, such as the scheduler's queue balancing. +In the case of system alarms, this callback is the \texttt{cause\_alarm} function~\cite{minix_do_setalarm}. This function sends a notification message from the Clock Task to the requesting process (e.g., the Process Manager or Scheduler), informing it that the requested time interval has elapsed. This mechanism allows user space servers to perform periodic tasks, such as the scheduler's queue balancing. \subsubsection{Real-Time Clock (RTC)} -Persistent wall-clock time is provided by a dedicated user space driver, \texttt{readclock}. This driver interfaces directly with the CMOS Real-Time Clock (RTC) to retrieve the current date and time. During system initialization, this information is communicated to the Process Manager to establish the system's initial time reference \cite{minix_readclock}. +Persistent wall-clock time is provided by a dedicated user space driver, \texttt{readclock}. This driver interfaces directly with the CMOS Real-Time Clock (RTC) to retrieve the current date and time. During system initialization, this information is communicated to the Process Manager to establish the system's initial time reference~\cite{minix_readclock}. \subsection{Memory Management} Memory management in Minix 3 is primarily handled by the user space Virtual Memory (VM) server. The kernel retains control over the hardware Memory Management Unit (MMU) and address space switching but delegates higher-level memory policies to the VM server. \subsubsection{VM Server} -The VM server is responsible for memory region allocation, page table administration, and page fault handling. It implements a region-based memory model in which a process address space is divided into contiguous regions with defined access permissions (read, write, and execute). To efficiently manage free and allocated memory regions, the VM server uses \textbf{AVL trees} \cite{minix_vm}. +The VM server is responsible for memory region allocation, page table administration, and page fault handling. It implements a region-based memory model in which a process address space is divided into contiguous regions with defined access permissions (read, write, and execute). To efficiently manage free and allocated memory regions, the VM server uses \textbf{AVL trees}~\cite{minix_vm}. \subsubsection{Allocation and Paging} -Internal memory allocation for kernel data structures and VM metadata is performed using a slab allocator \cite{slab_allocation}. Minix 3 supports demand paging: when a process attempts to access an unmapped virtual address, a page fault is raised. The kernel intercepts this exception and forwards the fault information to the VM server, which resolves it by mapping an appropriate physical frame or retrieving the required page from secondary storage \cite{shenoy_lecture}. +Internal memory allocation for kernel data structures and VM metadata is performed using a slab allocator~\cite{slab_allocation}. Minix 3 supports demand paging: when a process attempts to access an unmapped virtual address, a page fault is raised. The kernel intercepts this exception and forwards the fault information to the VM server, which resolves it by mapping an appropriate physical frame or retrieving the required page from secondary storage~\cite{shenoy_lecture}. \subsubsection{Memory Grants} -Due to the strict isolation of address spaces in the microkernel architecture, processes cannot directly access the memory of others. To facilitate the exchange of large data structures without violating protection boundaries, Minix 3 provides a capability-based mechanism known as \texttt{Memory Grants}, accessed via the \texttt{safecopy} API \cite{minix_safecopy}. +Due to the strict isolation of address spaces in the microkernel architecture, processes cannot directly access the memory of others. To facilitate the exchange of large data structures without violating protection boundaries, Minix 3 provides a capability-based mechanism known as \texttt{Memory Grants}, accessed via the \texttt{safecopy} API~\cite{minix_safecopy}. A process (the grantor) dynamically generates a grant capability that explicitly permits a specific peer process (the grantee) to read from or write to a designated memory range. The grantee utilizes this grant ID to request a data transfer from the System Task. The kernel validates the grant permissions before performing the copy operation between the disjoint address spaces. This mechanism ensures that drivers and servers can operate on client buffers while preventing unauthorized access to arbitrary memory locations. @@ -3265,10 +3225,10 @@ \subsubsection{Communication Policies} \end{enumerate} \subsubsection{System Call Implementation} -In Minix 3, system calls are not implemented as direct kernel invocations. Instead, standard library functions (e.g., \texttt{read}, \texttt{fork}) serve as wrappers that construct a message containing the call arguments and transmit it to the appropriate server using the \texttt{sendrec} primitive \cite{minix_ipc}. If the server needs to perform a privileged operation, it forwards the request to the System Task on behalf of the calling process, as servers possess higher privileges than user processes. +In Minix 3, system calls are not implemented as direct kernel invocations. Instead, standard library functions (e.g., \texttt{read}, \texttt{fork}) serve as wrappers that construct a message containing the call arguments and transmit it to the appropriate server using the \texttt{sendrec} primitive~\cite{minix_ipc}. If the server needs to perform a privileged operation, it forwards the request to the System Task on behalf of the calling process, as servers possess higher privileges than user processes. \subsubsection{Server Delegation} -System calls are dispatched to specialized servers according to their functionality \cite{minix_callnr}: +System calls are dispatched to specialized servers according to their functionality~\cite{minix_callnr}: \begin{itemize} \item \textbf{Process Manager (PM):} Responsible for process lifecycle operations such as \texttt{fork}, \texttt{exec}, \texttt{exit}, and \texttt{wait}. \item \textbf{Virtual File System (VFS):} Handles file-related operations including \texttt{open}, \texttt{read}, \texttt{write}, and \texttt{stat}. @@ -3283,13 +3243,13 @@ \chapter{Low-Level Kernel Implementation} \section{Bootloader} \label{sec:bootloader_implementation} -Before the architecture-agnostic kernel can commence execution, the underlying hardware must be brought to a known, deterministic state. As discussed in Section \ref{subsec:theory_bootloader}, the initialization protocols differ significantly depending on the specific hardware architecture and the firmware interface. +Before the architecture-agnostic kernel can commence execution, the underlying hardware must be brought to a known, deterministic state. As discussed in Section~\ref{subsec:theory_bootloader}, the initialization protocols differ significantly depending on the specific hardware architecture and the firmware interface. -For the AlkOS kernel on the x86-64 architecture, we rely on the \textbf{Multiboot2} specification \cite{Multiboot2-Spec}. This allows us to utilize a battle-tested external bootloader, such as GRUB, to handle the intricacies of storage drivers, file system parsing, and memory map retrieval. GRUB loads our kernel binary into memory and transfers control to our entry point. +For the AlkOS kernel on the x86-64 architecture, we rely on the \textbf{Multiboot2} specification~\cite{Multiboot2-Spec}. This allows us to utilize a battle-tested external bootloader, such as GRUB, to handle the intricacies of storage drivers, file system parsing, and memory map retrieval. GRUB loads our kernel binary into memory and transfers control to our entry point. However, the state provided by Multiboot2 is insufficient for the immediate execution of our C++ kernel. Upon entry, the system is in \textbf{32-bit Protected Mode}, paging is disabled, interrupts are disabled, and the stack pointer is undefined. Furthermore, the kernel is loaded at a low physical address, whereas AlkOS is designed as a \textbf{higher-half kernel}: its code and data are linked to virtual addresses in the upper portion of the 64-bit address space (\texttt{0xFFFFFFFE00000000}). -To bridge this gap, we implemented a two-stage internal bootloader: \textbf{Loader32} and \textbf{Loader64}. Each stage exists as a separate binary with distinct compilation requirements, as illustrated in Figure \ref{fig:boot_flow}. +To bridge this gap, we implemented a two-stage internal bootloader: \textbf{Loader32} and \textbf{Loader64}. Each stage exists as a separate binary with distinct compilation requirements, as illustrated in Figure~\ref{fig:boot_flow}. \begin{figure}[htbp] \centering @@ -3331,14 +3291,14 @@ \subsection{Why Two Internal Loaders?} \begin{enumerate} \item \textbf{Loader32 contains 32-bit instructions}: The transition from Protected Mode to Long Mode requires executing 32-bit code. Instructions such as setting control register bits, loading the GDT, and performing the mode switch \textit{cannot} be encoded in 64-bit instruction format. Thus, Loader32 is compiled as a 32-bit ELF binary (\texttt{elf32-i386}). - \item \textbf{Loader64 is Position-Independent Code (PIC)}: The kernel is linked at a fixed higher-half virtual address (\texttt{0xFFFFFFFE00000000}). However, Loader64 must execute \textit{before} virtual memory is fully configured. GRUB loads Loader64 elf as a Multiboot2 module. This elf must still be loaded in the first place in memory that is big enough to hold it, meaning at an \textit{unknown physical address}. To function correctly regardless of where it is loaded, Loader64 is compiled as a Position-Independent Executable (PIE) using \texttt{-fPIE -pie} flags, employing RIP-relative addressing for all memory accesses. + \item \textbf{Loader64 is Position-Independent Code (PIC)}: The kernel is linked at a fixed higher-half virtual address (\texttt{0xFFFFFFFE00000000}). However, Loader64 must execute \textit{before} virtual memory is fully configured. GRUB loads Loader64 ELF as a Multiboot2 module. This ELF must still be loaded in the first place in memory that is big enough to hold it, meaning at an \textit{unknown physical address}. To function correctly regardless of where it is loaded, Loader64 is compiled as a Position-Independent Executable (PIE) using \texttt{-fPIE -pie} flags, employing RIP-relative addressing for all memory accesses. \item \textbf{The kernel ignores its physical location}: Once virtual memory is established with the higher-half mapping, the kernel code executes using virtual addresses. It is unaware of, and unaffected by, its underlying physical memory location. \end{enumerate} \subsection{Loader32: 32-bit to 64-bit Transition} -Loader32 is the true entry point of AlkOS, receiving control directly from GRUB. Its primary responsibility is to transition the CPU from 32-bit Protected Mode to 64-bit Long Mode, enabling the 4-level paging required by the x86-64 architecture \cite{Intel-ControlRegisters}. +Loader32 is the true entry point of AlkOS, receiving control directly from GRUB. Its primary responsibility is to transition the CPU from 32-bit Protected Mode to 64-bit Long Mode, enabling the 4-level paging required by the x86-64 architecture~\cite{Intel-ControlRegisters}. \subsubsection{Entry Point} @@ -3365,7 +3325,7 @@ \subsubsection{Entry Point} \subsubsection{CPU Feature Detection} -Before attempting the Long Mode transition, Loader32 verifies that the CPU supports the required features. This is accomplished through the CPUID instruction \cite{Intel-CPUID}, as shown in Listing~\ref{lst:check_longmode}: +Before attempting the Long Mode transition, Loader32 verifies that the CPU supports the required features. This is accomplished through the CPUID instruction~\cite{Intel-CPUID}, as shown in Listing~\ref{lst:check_longmode}: \begin{nasmcode}[caption={Long Mode Detection (checks.nasm)}, label={lst:check_longmode}] EXTENDED_FUNCTIONS_THRESHOLD equ 0x80000000 @@ -3422,7 +3382,7 @@ \subsubsection{Memory Management Initialization} \subsubsection{Long Mode Enablement} -The transition to Long Mode follows the sequence defined in the Intel Software Developer's Manual \cite{Intel-ControlRegisters}, as shown in Listing~\ref{lst:enable_longmode}: +The transition to Long Mode follows the sequence defined in the Intel Software Developer's Manual~\cite{Intel-ControlRegisters}, as shown in Listing~\ref{lst:enable_longmode}: \begin{nasmcode}[caption={Enabling Long Mode and Paging (enables.nasm)}, label={lst:enable_longmode}] LONG_MODE_BIT equ 1 << 8 @@ -3732,7 +3692,7 @@ \subsubsection{Kernel Arguments Preparation} \subsection{Memory Layout} -Figure \ref{fig:memory_layout} illustrates the virtual memory layout established by the bootloader before kernel execution begins. +Figure~\ref{fig:memory_layout} illustrates the virtual memory layout established by the bootloader before kernel execution begins. \begin{figure}[htbp] \centering @@ -3868,7 +3828,7 @@ \subsubsection{The \texttt{PageMeta} Structure} To minimize memory overhead, the \texttt{PageMeta} structure utilizes a C++ \texttt{union} to multiplex its data fields. A physical page acts in mutually exclusive roles at any given time: it can be a free block in the Buddy System, a slab cache, or a generic allocated page, but never multiple simultaneously. By overlapping the storage requirements for these roles, the structure remains compact. -As shown in Listing \ref{lst:page_meta}, the structure contains a fixed header (type and order) and a polymorphic data section. +As shown in Listing~\ref{lst:page_meta}, the structure contains a fixed header (type and order) and a polymorphic data section. \begin{cppcode}[caption={Polymorphic Page Metadata Structure}, label={lst:page_meta}] struct PageMeta { @@ -4160,13 +4120,13 @@ \subsection{Summary of Memory Flow} \section{Virtual Memory Management} \label{sec:impl_vmm} -While the Physical Memory Manager accounts for raw RAM usage, the Virtual Memory Manager (VMM) allows the kernel to shape the execution environment. Our implementation aims to provide a robust, architecture-agnostic interface that supports advanced features like lazy allocation and efficient kernel-space management, while abstracting away the specifics of the x86-64 paging hardware. +While the Physical Memory Manager accounts for raw RAM usage, the Virtual Memory Manager (VMM) allows the kernel to shape the execution environment. Our implementation aims to provide a robust, architecture-agnostic interface that supports advanced features like lazy allocation and efficient kernel space management, while abstracting away the specifics of the x86-64 paging hardware. \subsection{Design Overview and Architecture} The cornerstone of the AlkOS memory layout is the \textbf{Higher-Half Kernel} architecture combined with a \textbf{Direct Physical Mapping}. -Upon initialization, the bootloader (Loader64) ensures that the kernel executable is linked and loaded at the top of the virtual address space, specifically starting at \texttt{0xFFFFFFFE00000000}. This design choice keeps the lower half of the address space (canonical addresses starting with \texttt{0x0000...}) strictly reserved for user-space applications, enforcing a clean separation between kernel and user contexts. +Upon initialization, the bootloader (Loader64) ensures that the kernel executable is linked and loaded at the top of the virtual address space, specifically starting at \texttt{0xFFFFFFFE00000000}. This design choice keeps the lower half of the address space (canonical addresses starting with \texttt{0x0000...}) strictly reserved for user space applications, enforcing a clean separation between kernel and user contexts. \subsubsection{The Direct Physical Map (HHDM)} @@ -4234,7 +4194,7 @@ \subsection{The Address Space Abstraction} \subsubsection{User Space Initialization} -When creating a new user address space, the system must ensure that the process can still access essential kernel services (such as interrupt handlers and syscall entry points). The \texttt{AddressSpace::InitUser} method allocates a new, empty PML4 table and immediately copies the kernel-space mappings from the global kernel address space. This creates a "split" view where the lower half is empty (ready for user program code), and the upper half mirrors the kernel. +When creating a new user address space, the system must ensure that the process can still access essential kernel services (such as interrupt handlers and syscall entry points). The \texttt{AddressSpace::InitUser} method allocates a new, empty PML4 table and immediately copies the kernel space mappings from the global kernel address space. This creates a "split" view where the lower half is empty (ready for user program code), and the upper half mirrors the kernel. \subsection{Virtual Memory Areas (VMAs)} @@ -4251,7 +4211,7 @@ \subsubsection{Direct Mapping} The \texttt{DirectMappingVMemArea} is used when a specific range of physical memory must be exposed to virtual memory. This is essential for: \begin{itemize} \item \textbf{Memory Mapped I/O (MMIO):} Mapping hardware device registers. - \item \textbf{Framebuffers:} Exposing the video memory to user-space applications (as seen in the Window Manager implementation). + \item \textbf{Framebuffers:} Exposing the video memory to user space applications (as seen in the Window Manager implementation). \end{itemize} Unlike anonymous memory, this area ignores the "Not Present" fault mechanism for allocation; it simply maps the pre-existing physical range. @@ -4288,12 +4248,12 @@ \subsubsection{Kernel Synchronization Area} \subsection{Initialization and Handover} -The transition from the bootloader to the kernel's internal memory management is a critical phase. When the kernel's main function receives control, the system is running on page tables created by \texttt{Loader64}. These tables contain a temporary identity mapping (0-10 GB) that is necessary for the loader but undesirable for the kernel, as it exposes physical memory to user-space (NULL pointer dereferences would be valid accesses to physical address 0). +The transition from the bootloader to the kernel's internal memory management is a critical phase. When the kernel's main function receives control, the system is running on page tables created by \texttt{Loader64}. These tables contain a temporary identity mapping (0-10 GB) that is necessary for the loader but undesirable for the kernel, as it exposes physical memory to user space (NULL pointer dereferences would be valid accesses to physical address 0). To sanitize this state, we implemented the \texttt{Mem::Boot::BootMmuCleaner}. Its responsibilities are twofold: \begin{enumerate} - \item \textbf{Cleaning Identity Mappings:} It iterates through the lower half of the PML4 table and unmaps all entries. This ensures that the user-space range is completely empty before the first process is spawned. + \item \textbf{Cleaning Identity Mappings:} It iterates through the lower half of the PML4 table and unmaps all entries. This ensures that the user space range is completely empty before the first process is spawned. \item \textbf{Metadata Reconstruction:} The physical frames used by the bootloader to store the page tables themselves must be accounted for. The \texttt{BootMmuCleaner} walks the existing page table hierarchy. For every table frame encountered, it updates the global \texttt{PageMetaTable} to mark that frame's type as \texttt{PageTable} and initializes its reference count. This prevents the Physical Memory Manager from accidentally treating these critical frames as free memory. \end{enumerate} @@ -4385,7 +4345,7 @@ \section{Interrupts} \subsection{x86-64 Interrupts} \subsubsection{Interrupt Handling} -On the x86-64 architecture, every interrupt is assigned a unique number (vector) which maps directly to an entry in the \textbf{Interrupt Descriptor Table} (IDT) \cite{IntelManual-Interrupts}. This table contains instructions for the CPU on how to react to specific interrupts. The entry layout is as follows: +On the x86-64 architecture, every interrupt is assigned a unique number (vector) which maps directly to an entry in the \textbf{Interrupt Descriptor Table} (IDT)~\cite{IntelManual-Interrupts}. This table contains instructions for the CPU on how to react to specific interrupts. The entry layout is as follows: \begin{cppcode}[caption={IDT Entry Layout}, label={lst:idtEntry}] enum class IdtPrivilegeLevel : u8 { kRing0 = 0, kRing1 = 1, kRing2 = 2, kRing3 = 3 }; @@ -4408,11 +4368,11 @@ \subsubsection{Interrupt Handling} }; \end{cppcode} -The most critical component of the IDT entry is the address of the function to be invoked (split into \texttt{isr\_low}, \texttt{isr\_mid}, and \texttt{isr\_high}). Another important field is \texttt{kernel\_cs}, which specifies the code segment \cite{IntelManual-Segments} loaded before executing the handler. In x86-64, there are four privilege levels (Rings). We assume the kernel always operates in Ring 0 (the most privileged level). Therefore, the Ring 0 kernel code segment is always written to this field. Conversely, the \texttt{IdtPrivilegeLevel dpl} field specifies the minimal privilege level required to trigger the interrupt via software, which is essential for implementing system calls (syscalls) invoked from user space. +The most critical component of the IDT entry is the address of the function to be invoked (split into \texttt{isr\_low}, \texttt{isr\_mid}, and \texttt{isr\_high}). Another important field is \texttt{kernel\_cs}, which specifies the code segment~\cite{IntelManual-Segments} loaded before executing the handler. In x86-64, there are four privilege levels (Rings). We assume the kernel always operates in Ring 0 (the most privileged level). Therefore, the Ring 0 kernel code segment is always written to this field. Conversely, the \texttt{IdtPrivilegeLevel dpl} field specifies the minimal privilege level required to trigger the interrupt via software, which is essential for implementing system calls (syscalls) invoked from user space. \subsubsection{Interrupt Service Routines} -Functions handling interrupts differ significantly from standard functions generated by the compiler. When the CPU calls an interrupt handler, it first switches the stack pointer to the kernel stack if a privilege level change occurs (e.g., Ring 3 to Ring 0). This transition is managed via the Task State Segment (TSS) mechanism (for details, refer to \cite{osdev-tss}). The CPU then changes the code segment as specified in the \textbf{IDT entry} (Listing \ref{lst:idtEntry}). Subsequently, it pushes the state of the interrupted procedure onto the stack. +Functions handling interrupts differ significantly from standard functions generated by the compiler. When the CPU calls an interrupt handler, it first switches the stack pointer to the kernel stack if a privilege level change occurs (e.g., Ring 3 to Ring 0). This transition is managed via the Task State Segment (TSS) mechanism (for details, refer to~\cite{osdev-tss}). The CPU then changes the code segment as specified in the \textbf{IDT entry} (Listing~\ref{lst:idtEntry}). Subsequently, it pushes the state of the interrupted procedure onto the stack. \begin{figure}[htbp] \centering @@ -4452,7 +4412,7 @@ \subsubsection{Interrupt Service Routines} \label{fig:stackframe} \end{figure} -The layout illustrated in Figure \ref{fig:stackframe} is known as the \textbf{Interrupt Frame}. This structure is fundamental to the kernel architecture, serving as the basis for context switching, context conversion, and jumping to user space (Ring 3). A key distinction is that an \textbf{ISR} must return using the special instruction \textbf{IRETQ} \cite{IntelManual-Interrupts}, which reverses the actions described above, including restoring the privilege level and code segment. +The layout illustrated in Figure~\ref{fig:stackframe} is known as the \textbf{Interrupt Frame}. This structure is fundamental to the kernel architecture, serving as the basis for context switching, context conversion, and jumping to user space (Ring 3). A key distinction is that an \textbf{ISR} must return using the special instruction \textbf{IRETQ}~\cite{IntelManual-Interrupts}, which reverses the actions described above, including restoring the privilege level and code segment. A significant challenge with standard compiler-generated functions is that they may modify the stack frame (prologue/epilogue) in ways that interfere with the hardware-defined layout. To maintain full control and prevent stack corruption, we implement assembly wrappers. These wrappers perform the necessary architecture-specific actions before invoking the architecture-agnostic interrupt handling code defined in C++. @@ -4501,14 +4461,14 @@ \subsubsection{Interrupt Service Routines} \subsubsection{Synchronization} -Since an interrupt can potentially trigger a context switch, synchronization is crucial. This applies to both kernel-space threads at any moment of their lifetime and user space programs executing system calls. It would be catastrophic if a timer interrupt forced a context switch while the kernel was in the middle of updating scheduler structures or memory tables. Therefore, even on a single-core system, synchronization must be enforced. This is achieved by disabling hardware interrupts (using \textbf{CLI} and \textbf{STI} instructions on x86-64) during critical sections. +Since an interrupt can potentially trigger a context switch, synchronization is crucial. This applies to both kernel space threads at any moment of their lifetime and user space programs executing system calls. It would be catastrophic if a timer interrupt forced a context switch while the kernel was in the middle of updating scheduler structures or memory tables. Therefore, even on a single-core system, synchronization must be enforced. This is achieved by disabling hardware interrupts (using \textbf{CLI} and \textbf{STI} instructions on x86-64) during critical sections. \subsection{Unified Interrupt Frame} \label{subsec:unifiedFrame} -To simplify interrupt handling, we introduced a unified frame structure based on the hardware Interrupt Frame (Figure \ref{fig:stackframe}). Since the state of a thread must be preserved before a context switch, all general-purpose registers must be saved. To achieve this efficiently, these registers are pushed onto the stack by the assembly wrapper, creating the \textbf{Unified Interrupt Frame}. +To simplify interrupt handling, we introduced a unified frame structure based on the hardware Interrupt Frame (Figure~\ref{fig:stackframe}). Since the state of a thread must be preserved before a context switch, all general-purpose registers must be saved. To achieve this efficiently, these registers are pushed onto the stack by the assembly wrapper, creating the \textbf{Unified Interrupt Frame}. -The x86-64 architecture introduces an inconsistency in the stack layout depending on the interrupt source. Certain exceptions, such as Page Faults (vector 14) or General Protection Faults (vector 13), automatically push an error code onto the stack by the CPU. Hardware interrupts and other exceptions do not. To use a single, unified C++ structure for all interrupt handling (\texttt{IsrErrorStackFrame}), the assembly entry wrappers must normalize the stack. For vectors that do not produce a hardware error code, the wrapper explicitly pushes a dummy value (typically 0) before saving the general-purpose registers (as seen in Listing \ref{lst:isr_asm}). This ensures that the stack pointer is always aligned correctly and points to a uniform structure when the C++ handler is invoked. +The x86-64 architecture introduces an inconsistency in the stack layout depending on the interrupt source. Certain exceptions, such as Page Faults (vector 14) or General Protection Faults (vector 13), automatically push an error code onto the stack by the CPU. Hardware interrupts and other exceptions do not. To use a single, unified C++ structure for all interrupt handling (\texttt{IsrErrorStackFrame}), the assembly entry wrappers must normalize the stack. For vectors that do not produce a hardware error code, the wrapper explicitly pushes a dummy value (typically 0) before saving the general-purpose registers (as seen in Listing~\ref{lst:isr_asm}). This ensures that the stack pointer is always aligned correctly and points to a uniform structure when the C++ handler is invoked. \begin{figure}[htbp] \centering @@ -4561,7 +4521,7 @@ \subsection{Context Switch} \item Jumps back to the code address pointed to by \textbf{RIP}. \end{itemize} -To switch contexts, we must ensure the current thread's state is preserved in a \textbf{Unified Interrupt Frame} (Figure \ref{fig:unifiedstackframe}) on its kernel stack. Additionally, a valid frame must exist for the target thread. If a thread has run previously, it will have saved its frame naturally during its last preemption. However, for a new thread that has never executed, this frame must be constructed manually. +To switch contexts, we must ensure the current thread's state is preserved in a \textbf{Unified Interrupt Frame} (Figure~\ref{fig:unifiedstackframe}) on its kernel stack. Additionally, a valid frame must exist for the target thread. If a thread has run previously, it will have saved its frame naturally during its last preemption. However, for a new thread that has never executed, this frame must be constructed manually. We assume that all threads begin execution in kernel space before eventually jumping to user space. Since all interrupt handling occurs within the kernel, the interrupt frame always resides on the kernel stack. Therefore, we can initialize the kernel stack of the new thread with a fabricated frame before the context switch. The initialization procedure is shown in Listing~\ref{lst:threadStack}: @@ -4600,7 +4560,7 @@ \subsection{Context Switch} } \end{cppcode} -With both stack frames prepared, the context switch logic is straightforward: swap the \textbf{RSP} (current stack pointer) to the kernel stack of the target thread and execute the \textbf{IRETQ} instruction. This instruction restores the state and resumes execution, as demonstrated in the \textbf{context\_switch\_if\_needed} macro (Listing \ref{lst:isr_asm}). +With both stack frames prepared, the context switch logic is straightforward: swap the \textbf{RSP} (current stack pointer) to the kernel stack of the target thread and execute the \textbf{IRETQ} instruction. This instruction restores the state and resumes execution, as demonstrated in the \textbf{context\_switch\_if\_needed} macro (Listing~\ref{lst:isr_asm}). \subsection{Jumping to User Space} @@ -4713,7 +4673,7 @@ \subsection{Clocks} For the x86-64 architecture, clock selection is prioritized based on availability and precision in the following order: TSC > HPET > RTC > PIT. Currently, support is implemented only for the TSC and HPET drivers. \subsubsection{TSC} -The Time Stamp Counter (TSC) is the optimal clock source due to its high precision and core-locality. Accessing the TSC involves reading a CPU register, which requires only a few cycles, unlike external clocks that may require hundreds. However, the TSC has historical limitations. On older processors, the counter's frequency was tied to the core frequency. Consequently, frequency scaling (throttling) caused the time measurement to drift, rendering it unreliable. Modern Intel processors introduced the Invariant TSC, which ensures a constant frequency regardless of the core's power state. Another limitation is that on certain CPUs, the TSC frequency is not explicitly known and must be measured against a known reference. For this purpose, the system utilizes the HPET, which serves as the minimal hardware requirement for reliable calibration (see \cite{IntelManual-TSC}). +The Time Stamp Counter (TSC) is the optimal clock source due to its high precision and core-locality. Accessing the TSC involves reading a CPU register, which requires only a few cycles, unlike external clocks that may require hundreds. However, the TSC has historical limitations. On older processors, the counter's frequency was tied to the core frequency. Consequently, frequency scaling (throttling) caused the time measurement to drift, rendering it unreliable. Modern Intel processors introduced the Invariant TSC, which ensures a constant frequency regardless of the core's power state. Another limitation is that on certain CPUs, the TSC frequency is not explicitly known and must be measured against a known reference. For this purpose, the system utilizes the HPET, which serves as the minimal hardware requirement for reliable calibration (see~\cite{IntelManual-TSC}). \subsubsection{HPET} The High Precision Event Timer (HPET) also offers high precision and is generally more stable than the TSC on older hardware. However, it is an external device mapped via Memory-Mapped I/O (MMIO). As a result, accessing the HPET is significantly slower than reading the TSC, potentially taking up to 1000 cycles per read. Due to this performance overhead, the HPET is designated as the secondary choice. It is primarily utilized for calibrating other clocks rather than for frequent timekeeping operations. @@ -4778,12 +4738,12 @@ \subsubsection{LAPIC Timer} \chapter{High-Level Subsystems} \label{chap:high_level_subsystems} -This chapter describes the high-level subsystems of AlkOS that build upon the low-level kernel infrastructure presented in the chapter \ref{chap:low_level_implementation}. These components are the file system and the scheduler. +This chapter describes the high-level subsystems of AlkOS that build upon the low-level kernel infrastructure presented in the chapter~\ref{chap:low_level_implementation}. These components are the file system and the scheduler. \section{File System} \label{section:fs} -The file system is a fundamental part of any operating sContextSwitchystem, providing mechanisms to store, access, and manage data on storage devices. We have designed a unified abstraction layer known as the Virtual File System (VFS). The VFS abstracts the underlying implementation details of specific file systems, providing the kernel and user space with an API for operations such as reading, writing, creating, moving, and deleting files and directories. +The file system is a fundamental part of any operating system, providing mechanisms to store, access, and manage data on storage devices. We have designed a unified abstraction layer known as the Virtual File System (VFS). The VFS abstracts the underlying implementation details of specific file systems, providing the kernel and user space with an API for operations such as reading, writing, creating, moving, and deleting files and directories. \subsection{VFS} @@ -4837,7 +4797,7 @@ \subsection{VFS} \label{fig:vfs_architecture} \end{figure} -As illustrated in Figure \ref{fig:vfs_architecture}, the VFS architecture is composed of four primary layers: +As illustrated in Figure~\ref{fig:vfs_architecture}, the VFS architecture is composed of four primary layers: \begin{itemize} \item \textbf{VFS Module}: The central orchestrator for path resolution and operation delegation. @@ -4854,7 +4814,7 @@ \subsubsection{VFS Module} When a VFS operation (e.g., \texttt{CreateFile}) is invoked, the module executes the following sequence: \begin{enumerate} - \item \textbf{Find Mount Point}: The system utilizes a \textbf{crit-bit tree} \cite{critbit} to efficiently locate the longest prefix match for the given path among all registered mount points. This step identifies the specific file system driver responsible for handling the operation. + \item \textbf{Find Mount Point}: The system utilizes a \textbf{crit-bit tree}~\cite{critbit} to efficiently locate the longest prefix match for the given path among all registered mount points. This step identifies the specific file system driver responsible for handling the operation. \item \textbf{Check Permissions}: The module validates the operation against the mount point options (e.g., ensuring write operations are not attempted on read-only mounts). \item \textbf{Path Translation}: The absolute system path is translated into a path relative to the mount point root. \item \textbf{Delegation}: The operation is delegated to the corresponding method of the specific file system driver. @@ -4915,9 +4875,9 @@ \subsubsection{File System Interface} \subsubsection{File System Driver} -Each file system driver implements the logic required for a specific file system type (e.g., FAT12, FAT16, FAT32). The driver interprets the raw data structures on the storage device and performs the requested operations. For instance, the FAT32 driver handles the manipulation of the File Allocation Table, directory entries, and cluster chains according to the FAT32 specification (for details, see \cite{fat-spec}). +Each file system driver implements the logic required for a specific file system type (e.g., FAT12, FAT16, FAT32). The driver interprets the raw data structures on the storage device and performs the requested operations. For instance, the FAT32 driver handles the manipulation of the File Allocation Table, directory entries, and cluster chains according to the FAT32 specification (for details, see~\cite{fat-spec}). -To implement these drivers efficiently, we employ the Curiously Recurring Template Pattern (CRTP) \cite{crtp}. Each driver class inherits from a templated base class that provides common functionality, while the derived class implements format-specific details (Listing~\ref{lst:fatCrtp}). This approach enables static polymorphism, reducing the runtime overhead typically associated with virtual function calls. +To implement these drivers efficiently, we employ the Curiously Recurring Template Pattern (CRTP)~\cite{crtp}. Each driver class inherits from a templated base class that provides common functionality, while the derived class implements format-specific details (Listing~\ref{lst:fatCrtp}). This approach enables static polymorphism, reducing the runtime overhead typically associated with virtual function calls. \begin{cppcode}[caption={FAT Driver CRTP Base Class}, label={lst:fatCrtp}] template