Documentation/page_tables: Add info about MMU/TLB and Page Faults

Extend page_tables.rst by adding a section about the role of MMU and TLB
in translating between virtual addresses and physical page frames.
Furthermore explain the concept behind Page Faults and how the Linux
kernel handles TLB misses. Finally briefly explain how and why to disable
the page faults handler.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
Acked-by: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230818112726.6156-1-fmdefrancesco@gmail.com
This commit is contained in:
Fabio M. De Francesco 2023-08-18 13:19:34 +02:00 committed by Jonathan Corbet
parent f00c19c67a
commit 4d83d5cdfa
1 changed files with 127 additions and 0 deletions

View File

@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the
virtual memory manager, will need to be written so that it traverses all of the
currently five levels. This style should also be preferred for
architecture-specific code, so as to be robust to future changes.
MMU, TLB, and Page Faults
=========================
The `Memory Management Unit (MMU)` is a hardware component that handles virtual
to physical address translations. It may use relatively small caches in hardware
called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
these translations.
When CPU accesses a memory location, it provides a virtual address to the MMU,
which checks if there is the existing translation in the TLB or in the Page
Walk Caches (on architectures that support them). If no translation is found,
MMU uses the page walks to determine the physical address and create the map.
The dirty bit for a page is set (i.e., turned on) when the page is written to.
Each page of memory has associated permission and dirty bits. The latter
indicate that the page has been modified since it was loaded into memory.
If nothing prevents it, eventually the physical memory can be accessed and the
requested operation on the physical frame is performed.
There are several reasons why the MMU can't find certain translations. It could
happen because the CPU is trying to access memory that the current task is not
permitted to, or because the data is not present into physical memory.
When these conditions happen, the MMU triggers page faults, which are types of
exceptions that signal the CPU to pause the current execution and run a special
function to handle the mentioned exceptions.
There are common and expected causes of page faults. These are triggered by
process management optimization techniques called "Lazy Allocation" and
"Copy-on-Write". Page faults may also happen when frames have been swapped out
to persistent storage (swap partition or file) and evicted from their physical
locations.
These techniques improve memory efficiency, reduce latency, and minimize space
occupation. This document won't go deeper into the details of "Lazy Allocation"
and "Copy-on-Write" because these subjects are out of scope as they belong to
Process Address Management.
Swapping differentiates itself from the other mentioned techniques because it's
undesirable since it's performed as a means to reduce memory under heavy
pressure.
Swapping can't work for memory mapped by kernel logical addresses. These are a
subset of the kernel virtual space that directly maps a contiguous range of
physical memory. Given any logical address, its physical address is determined
with simple arithmetic on an offset. Accesses to logical addresses are fast
because they avoid the need for complex page table lookups at the expenses of
frames not being evictable and pageable out.
If the kernel fails to make room for the data that must be present in the
physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
by terminating lower priority processes until pressure reduces under a safe
threshold.
Additionally, page faults may be also caused by code bugs or by maliciously
crafted addresses that the CPU is instructed to access. A thread of a process
could use instructions to address (non-shared) memory which does not belong to
its own address space, or could try to execute an instruction that want to write
to a read-only location.
If the above-mentioned conditions happen in user-space, the kernel sends a
`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
causes the termination of the thread and of the process it belongs to.
This document is going to simplify and show an high altitude view of how the
Linux kernel handles these page faults, creates tables and tables' entries,
check if memory is present and, if not, requests to load data from persistent
storage or from other devices, and updates the MMU and its caches.
The first steps are architecture dependent. Most architectures jump to
`do_page_fault()`, whereas the x86 interrupt handler is defined by the
`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.
Whatever the routes, all architectures end up to the invocation of
`handle_mm_fault()` which, in turn, (likely) ends up calling
`__handle_mm_fault()` to carry out the actual work of allocating the page
tables.
The unfortunate case of not being able to call `__handle_mm_fault()` means
that the virtual address is pointing to areas of physical memory which are not
permitted to be accessed (at least from the current context). This
condition resolves to the kernel sending the above-mentioned SIGSEGV signal
to the process and leads to the consequences already explained.
`__handle_mm_fault()` carries out its work by calling several functions to
find the entry's offsets of the upper layers of the page tables and allocate
the tables that it may need.
The functions that look for the offset have names like `*_offset()`, where the
"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
corresponding tables, layer by layer, are called `*_alloc`, using the
above-mentioned convention to name them after the corresponding types of tables
in the hierarchy.
The page table walk may end at one of the middle or upper layers (PMD, PUD).
Linux supports larger page sizes than the usual 4KB (i.e., the so called
`huge pages`). When using these kinds of larger pages, higher level pages can
directly map them, with no need to use lower level page entries (PTE). Huge
pages contain large contiguous physical regions that usually span from 2MB to
1GB. They are respectively mapped by the PMD and PUD page entries.
The huge pages bring with them several benefits like reduced TLB pressure,
reduced page table overhead, memory allocation efficiency, and performance
improvement for certain workloads. However, these benefits come with
trade-offs, like wasted memory and allocation challenges.
At the very end of the walk with allocations, if it didn't return errors,
`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()`
performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`.
"read", "cow", "shared" give hints about the reasons and the kind of fault it's
handling.
The actual implementation of the workflow is very complex. Its design allows
Linux to handle page faults in a way that is tailored to the specific
characteristics of each architecture, while still sharing a common overall
structure.
To conclude this high altitude view of how Linux handles page faults, let's
add that the page faults handler can be disabled and enabled respectively with
`pagefault_disable()` and `pagefault_enable()`.
Several code path make use of the latter two functions because they need to
disable traps into the page faults handler, mostly to prevent deadlocks.