Jeff Xu's implementation of the mseal() syscall.
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZlDhVAAKCRDdBJ7gKXxA jqDSAP0aGY505ka3+ffe6e5OP7W7syKjXHLy84Hp2t6YWnU+6QEA86qcXnfOI7HB 7FPy+fa9sMm6BfAAZPkYnICAgVpbBAw= =Q3vf -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more mm updates from Andrew Morton: "Jeff Xu's implementation of the mseal() syscall" * tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: selftest mm/mseal read-only elf memory segment mseal: add documentation selftest mm/mseal memory sealing mseal: add mseal syscall mseal: wire up mseal syscall
This commit is contained in:
commit
0b32d436c0
|
@ -20,6 +20,7 @@ System calls
|
|||
futex2
|
||||
ebpf/index
|
||||
ioctl/index
|
||||
mseal
|
||||
|
||||
Security-related interfaces
|
||||
===========================
|
||||
|
|
|
@ -0,0 +1,199 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================
|
||||
Introduction of mseal
|
||||
=====================
|
||||
|
||||
:Author: Jeff Xu <jeffxu@chromium.org>
|
||||
|
||||
Modern CPUs support memory permissions such as RW and NX bits. The memory
|
||||
permission feature improves security stance on memory corruption bugs, i.e.
|
||||
the attacker can’t just write to arbitrary memory and point the code to it,
|
||||
the memory has to be marked with X bit, or else an exception will happen.
|
||||
|
||||
Memory sealing additionally protects the mapping itself against
|
||||
modifications. This is useful to mitigate memory corruption issues where a
|
||||
corrupted pointer is passed to a memory management system. For example,
|
||||
such an attacker primitive can break control-flow integrity guarantees
|
||||
since read-only memory that is supposed to be trusted can become writable
|
||||
or .text pages can get remapped. Memory sealing can automatically be
|
||||
applied by the runtime loader to seal .text and .rodata pages and
|
||||
applications can additionally seal security critical data at runtime.
|
||||
|
||||
A similar feature already exists in the XNU kernel with the
|
||||
VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
|
||||
|
||||
User API
|
||||
========
|
||||
mseal()
|
||||
-----------
|
||||
The mseal() syscall has the following signature:
|
||||
|
||||
``int mseal(void addr, size_t len, unsigned long flags)``
|
||||
|
||||
**addr/len**: virtual memory address range.
|
||||
|
||||
The address range set by ``addr``/``len`` must meet:
|
||||
- The start address must be in an allocated VMA.
|
||||
- The start address must be page aligned.
|
||||
- The end address (``addr`` + ``len``) must be in an allocated VMA.
|
||||
- no gap (unallocated memory) between start and end address.
|
||||
|
||||
The ``len`` will be paged aligned implicitly by the kernel.
|
||||
|
||||
**flags**: reserved for future use.
|
||||
|
||||
**return values**:
|
||||
|
||||
- ``0``: Success.
|
||||
|
||||
- ``-EINVAL``:
|
||||
- Invalid input ``flags``.
|
||||
- The start address (``addr``) is not page aligned.
|
||||
- Address range (``addr`` + ``len``) overflow.
|
||||
|
||||
- ``-ENOMEM``:
|
||||
- The start address (``addr``) is not allocated.
|
||||
- The end address (``addr`` + ``len``) is not allocated.
|
||||
- A gap (unallocated memory) between start and end address.
|
||||
|
||||
- ``-EPERM``:
|
||||
- sealing is supported only on 64-bit CPUs, 32-bit is not supported.
|
||||
|
||||
- For above error cases, users can expect the given memory range is
|
||||
unmodified, i.e. no partial update.
|
||||
|
||||
- There might be other internal errors/cases not listed here, e.g.
|
||||
error during merging/splitting VMAs, or the process reaching the max
|
||||
number of supported VMAs. In those cases, partial updates to the given
|
||||
memory range could happen. However, those cases should be rare.
|
||||
|
||||
**Blocked operations after sealing**:
|
||||
Unmapping, moving to another location, and shrinking the size,
|
||||
via munmap() and mremap(), can leave an empty space, therefore
|
||||
can be replaced with a VMA with a new set of attributes.
|
||||
|
||||
Moving or expanding a different VMA into the current location,
|
||||
via mremap().
|
||||
|
||||
Modifying a VMA via mmap(MAP_FIXED).
|
||||
|
||||
Size expansion, via mremap(), does not appear to pose any
|
||||
specific risks to sealed VMAs. It is included anyway because
|
||||
the use case is unclear. In any case, users can rely on
|
||||
merging to expand a sealed VMA.
|
||||
|
||||
mprotect() and pkey_mprotect().
|
||||
|
||||
Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
|
||||
for anonymous memory, when users don't have write permission to the
|
||||
memory. Those behaviors can alter region contents by discarding pages,
|
||||
effectively a memset(0) for anonymous memory.
|
||||
|
||||
Kernel will return -EPERM for blocked operations.
|
||||
|
||||
For blocked operations, one can expect the given address is unmodified,
|
||||
i.e. no partial update. Note, this is different from existing mm
|
||||
system call behaviors, where partial updates are made till an error is
|
||||
found and returned to userspace. To give an example:
|
||||
|
||||
Assume following code sequence:
|
||||
|
||||
- ptr = mmap(null, 8192, PROT_NONE);
|
||||
- munmap(ptr + 4096, 4096);
|
||||
- ret1 = mprotect(ptr, 8192, PROT_READ);
|
||||
- mseal(ptr, 4096);
|
||||
- ret2 = mprotect(ptr, 8192, PROT_NONE);
|
||||
|
||||
ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
|
||||
|
||||
ret2 will be -EPERM, the page remains to be PROT_READ.
|
||||
|
||||
**Note**:
|
||||
|
||||
- mseal() only works on 64-bit CPUs, not 32-bit CPU.
|
||||
|
||||
- users can call mseal() multiple times, mseal() on an already sealed memory
|
||||
is a no-action (not error).
|
||||
|
||||
- munseal() is not supported.
|
||||
|
||||
Use cases:
|
||||
==========
|
||||
- glibc:
|
||||
The dynamic linker, during loading ELF executables, can apply sealing to
|
||||
non-writable memory segments.
|
||||
|
||||
- Chrome browser: protect some security sensitive data-structures.
|
||||
|
||||
Notes on which memory to seal:
|
||||
==============================
|
||||
|
||||
It might be important to note that sealing changes the lifetime of a mapping,
|
||||
i.e. the sealed mapping won’t be unmapped till the process terminates or the
|
||||
exec system call is invoked. Applications can apply sealing to any virtual
|
||||
memory region from userspace, but it is crucial to thoroughly analyze the
|
||||
mapping's lifetime prior to apply the sealing.
|
||||
|
||||
For example:
|
||||
|
||||
- aio/shm
|
||||
|
||||
aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
|
||||
shm.c. The lifetime of those mapping are not tied to the lifetime of the
|
||||
process. If those memories are sealed from userspace, then munmap() will fail,
|
||||
causing leaks in VMA address space during the lifetime of the process.
|
||||
|
||||
- Brk (heap)
|
||||
|
||||
Currently, userspace applications can seal parts of the heap by calling
|
||||
malloc() and mseal().
|
||||
let's assume following calls from user space:
|
||||
|
||||
- ptr = malloc(size);
|
||||
- mprotect(ptr, size, RO);
|
||||
- mseal(ptr, size);
|
||||
- free(ptr);
|
||||
|
||||
Technically, before mseal() is added, the user can change the protection of
|
||||
the heap by calling mprotect(RO). As long as the user changes the protection
|
||||
back to RW before free(), the memory range can be reused.
|
||||
|
||||
Adding mseal() into the picture, however, the heap is then sealed partially,
|
||||
the user can still free it, but the memory remains to be RO. If the address
|
||||
is re-used by the heap manager for another malloc, the process might crash
|
||||
soon after. Therefore, it is important not to apply sealing to any memory
|
||||
that might get recycled.
|
||||
|
||||
Furthermore, even if the application never calls the free() for the ptr,
|
||||
the heap manager may invoke the brk system call to shrink the size of the
|
||||
heap. In the kernel, the brk-shrink will call munmap(). Consequently,
|
||||
depending on the location of the ptr, the outcome of brk-shrink is
|
||||
nondeterministic.
|
||||
|
||||
|
||||
Additional notes:
|
||||
=================
|
||||
As Jann Horn pointed out in [3], there are still a few ways to write
|
||||
to RO memory, which is, in a way, by design. Those cases are not covered
|
||||
by mseal(). If applications want to block such cases, sandbox tools (such as
|
||||
seccomp, LSM, etc) might be considered.
|
||||
|
||||
Those cases are:
|
||||
|
||||
- Write to read-only memory through /proc/self/mem interface.
|
||||
- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
|
||||
- userfaultfd.
|
||||
|
||||
The idea that inspired this patch comes from Stephen Röttger’s work in V8
|
||||
CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
|
||||
|
||||
Reference:
|
||||
==========
|
||||
[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
|
||||
|
||||
[2] https://man.openbsd.org/mimmutable.2
|
||||
|
||||
[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
|
||||
|
||||
[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
|
|
@ -501,3 +501,4 @@
|
|||
569 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
570 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
571 common lsm_list_modules sys_lsm_list_modules
|
||||
572 common mseal sys_mseal
|
||||
|
|
|
@ -475,3 +475,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -39,7 +39,7 @@
|
|||
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
|
||||
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
|
||||
|
||||
#define __NR_compat_syscalls 462
|
||||
#define __NR_compat_syscalls 463
|
||||
#endif
|
||||
|
||||
#define __ARCH_WANT_SYS_CLONE
|
||||
|
|
|
@ -929,6 +929,8 @@ __SYSCALL(__NR_lsm_get_self_attr, sys_lsm_get_self_attr)
|
|||
__SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
|
||||
#define __NR_lsm_list_modules 461
|
||||
__SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
|
||||
#define __NR_mseal 462
|
||||
__SYSCALL(__NR_mseal, sys_mseal)
|
||||
|
||||
/*
|
||||
* Please add new compat syscalls above this comment and update
|
||||
|
|
|
@ -461,3 +461,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -467,3 +467,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -400,3 +400,4 @@
|
|||
459 n32 lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 n32 lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 n32 lsm_list_modules sys_lsm_list_modules
|
||||
462 n32 mseal sys_mseal
|
||||
|
|
|
@ -376,3 +376,4 @@
|
|||
459 n64 lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 n64 lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 n64 lsm_list_modules sys_lsm_list_modules
|
||||
462 n64 mseal sys_mseal
|
||||
|
|
|
@ -449,3 +449,4 @@
|
|||
459 o32 lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 o32 lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 o32 lsm_list_modules sys_lsm_list_modules
|
||||
462 o32 mseal sys_mseal
|
||||
|
|
|
@ -460,3 +460,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -548,3 +548,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -464,3 +464,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal sys_mseal
|
||||
|
|
|
@ -464,3 +464,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -507,3 +507,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -466,3 +466,4 @@
|
|||
459 i386 lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 i386 lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 i386 lsm_list_modules sys_lsm_list_modules
|
||||
462 i386 mseal sys_mseal
|
||||
|
|
|
@ -383,6 +383,7 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
||||
#
|
||||
# Due to a historical design error, certain syscalls are numbered differently
|
||||
|
|
|
@ -432,3 +432,4 @@
|
|||
459 common lsm_get_self_attr sys_lsm_get_self_attr
|
||||
460 common lsm_set_self_attr sys_lsm_set_self_attr
|
||||
461 common lsm_list_modules sys_lsm_list_modules
|
||||
462 common mseal sys_mseal
|
||||
|
|
|
@ -821,6 +821,7 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
|
|||
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
|
||||
unsigned long prot, unsigned long pgoff,
|
||||
unsigned long flags);
|
||||
asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned long flags);
|
||||
asmlinkage long sys_mbind(unsigned long start, unsigned long len,
|
||||
unsigned long mode,
|
||||
const unsigned long __user *nmask,
|
||||
|
|
|
@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
|
|||
#define __NR_lsm_list_modules 461
|
||||
__SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
|
||||
|
||||
#define __NR_mseal 462
|
||||
__SYSCALL(__NR_mseal, sys_mseal)
|
||||
|
||||
#undef __NR_syscalls
|
||||
#define __NR_syscalls 462
|
||||
#define __NR_syscalls 463
|
||||
|
||||
/*
|
||||
* 32 bit systems traditionally used different
|
||||
|
|
|
@ -196,6 +196,7 @@ COND_SYSCALL(migrate_pages);
|
|||
COND_SYSCALL(move_pages);
|
||||
COND_SYSCALL(set_mempolicy_home_node);
|
||||
COND_SYSCALL(cachestat);
|
||||
COND_SYSCALL(mseal);
|
||||
|
||||
COND_SYSCALL(perf_event_open);
|
||||
COND_SYSCALL(accept4);
|
||||
|
|
|
@ -43,6 +43,10 @@ ifdef CONFIG_CROSS_MEMORY_ATTACH
|
|||
mmu-$(CONFIG_MMU) += process_vm_access.o
|
||||
endif
|
||||
|
||||
ifdef CONFIG_64BIT
|
||||
mmu-$(CONFIG_MMU) += mseal.o
|
||||
endif
|
||||
|
||||
obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
|
||||
maccess.o page-writeback.o folio-compat.o \
|
||||
readahead.o swap.o truncate.o vmscan.o shrinker.o \
|
||||
|
|
|
@ -1435,6 +1435,43 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
|
|||
unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
|
||||
int priority);
|
||||
|
||||
#ifdef CONFIG_64BIT
|
||||
/* VM is sealed, in vm_flags */
|
||||
#define VM_SEALED _BITUL(63)
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_64BIT
|
||||
static inline int can_do_mseal(unsigned long flags)
|
||||
{
|
||||
if (flags)
|
||||
return -EINVAL;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
bool can_modify_mm(struct mm_struct *mm, unsigned long start,
|
||||
unsigned long end);
|
||||
bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
|
||||
unsigned long end, int behavior);
|
||||
#else
|
||||
static inline int can_do_mseal(unsigned long flags)
|
||||
{
|
||||
return -EPERM;
|
||||
}
|
||||
|
||||
static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start,
|
||||
unsigned long end)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
|
||||
unsigned long end, int behavior)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_SHRINKER_DEBUG
|
||||
static inline __printf(2, 0) int shrinker_debugfs_name_alloc(
|
||||
struct shrinker *shrinker, const char *fmt, va_list ap)
|
||||
|
|
12
mm/madvise.c
12
mm/madvise.c
|
@ -1401,6 +1401,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
|
|||
* -EIO - an I/O error occurred while paging in data.
|
||||
* -EBADF - map exists, but area maps something that isn't a file.
|
||||
* -EAGAIN - a kernel resource was temporarily unavailable.
|
||||
* -EPERM - memory is sealed.
|
||||
*/
|
||||
int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
|
||||
{
|
||||
|
@ -1444,6 +1445,15 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
|
|||
start = untagged_addr_remote(mm, start);
|
||||
end = start + len;
|
||||
|
||||
/*
|
||||
* Check if the address range is sealed for do_madvise().
|
||||
* can_modify_mm_madv assumes we have acquired the lock on MM.
|
||||
*/
|
||||
if (unlikely(!can_modify_mm_madv(mm, start, end, behavior))) {
|
||||
error = -EPERM;
|
||||
goto out;
|
||||
}
|
||||
|
||||
blk_start_plug(&plug);
|
||||
switch (behavior) {
|
||||
case MADV_POPULATE_READ:
|
||||
|
@ -1456,6 +1466,8 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
|
|||
break;
|
||||
}
|
||||
blk_finish_plug(&plug);
|
||||
|
||||
out:
|
||||
if (write)
|
||||
mmap_write_unlock(mm);
|
||||
else
|
||||
|
|
31
mm/mmap.c
31
mm/mmap.c
|
@ -1255,6 +1255,16 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
|
|||
if (mm->map_count > sysctl_max_map_count)
|
||||
return -ENOMEM;
|
||||
|
||||
/*
|
||||
* addr is returned from get_unmapped_area,
|
||||
* There are two cases:
|
||||
* 1> MAP_FIXED == false
|
||||
* unallocated memory, no need to check sealing.
|
||||
* 1> MAP_FIXED == true
|
||||
* sealing is checked inside mmap_region when
|
||||
* do_vmi_munmap is called.
|
||||
*/
|
||||
|
||||
if (prot == PROT_EXEC) {
|
||||
pkey = execute_only_pkey(mm);
|
||||
if (pkey < 0)
|
||||
|
@ -2727,6 +2737,14 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
|
|||
if (end == start)
|
||||
return -EINVAL;
|
||||
|
||||
/*
|
||||
* Check if memory is sealed before arch_unmap.
|
||||
* Prevent unmapping a sealed VMA.
|
||||
* can_modify_mm assumes we have acquired the lock on MM.
|
||||
*/
|
||||
if (unlikely(!can_modify_mm(mm, start, end)))
|
||||
return -EPERM;
|
||||
|
||||
/* arch_unmap() might do unmaps itself. */
|
||||
arch_unmap(mm, start, end);
|
||||
|
||||
|
@ -2789,7 +2807,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
|
|||
}
|
||||
|
||||
/* Unmap any existing mapping in the area */
|
||||
if (do_vmi_munmap(&vmi, mm, addr, len, uf, false))
|
||||
error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
|
||||
if (error == -EPERM)
|
||||
return error;
|
||||
else if (error)
|
||||
return -ENOMEM;
|
||||
|
||||
/*
|
||||
|
@ -3139,6 +3160,14 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
|
|||
{
|
||||
struct mm_struct *mm = vma->vm_mm;
|
||||
|
||||
/*
|
||||
* Check if memory is sealed before arch_unmap.
|
||||
* Prevent unmapping a sealed VMA.
|
||||
* can_modify_mm assumes we have acquired the lock on MM.
|
||||
*/
|
||||
if (unlikely(!can_modify_mm(mm, start, end)))
|
||||
return -EPERM;
|
||||
|
||||
arch_unmap(mm, start, end);
|
||||
return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
|
||||
}
|
||||
|
|
|
@ -32,6 +32,7 @@
|
|||
#include <linux/sched/sysctl.h>
|
||||
#include <linux/userfaultfd_k.h>
|
||||
#include <linux/memory-tiers.h>
|
||||
#include <uapi/linux/mman.h>
|
||||
#include <asm/cacheflush.h>
|
||||
#include <asm/mmu_context.h>
|
||||
#include <asm/tlbflush.h>
|
||||
|
@ -744,6 +745,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
|
|||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* checking if memory is sealed.
|
||||
* can_modify_mm assumes we have acquired the lock on MM.
|
||||
*/
|
||||
if (unlikely(!can_modify_mm(current->mm, start, end))) {
|
||||
error = -EPERM;
|
||||
goto out;
|
||||
}
|
||||
|
||||
prev = vma_prev(&vmi);
|
||||
if (start > vma->vm_start)
|
||||
prev = vma;
|
||||
|
|
31
mm/mremap.c
31
mm/mremap.c
|
@ -902,7 +902,25 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
|
|||
if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
|
||||
return -ENOMEM;
|
||||
|
||||
/*
|
||||
* In mremap_to().
|
||||
* Move a VMA to another location, check if src addr is sealed.
|
||||
*
|
||||
* Place can_modify_mm here because mremap_to()
|
||||
* does its own checking for address range, and we only
|
||||
* check the sealing after passing those checks.
|
||||
*
|
||||
* can_modify_mm assumes we have acquired the lock on MM.
|
||||
*/
|
||||
if (unlikely(!can_modify_mm(mm, addr, addr + old_len)))
|
||||
return -EPERM;
|
||||
|
||||
if (flags & MREMAP_FIXED) {
|
||||
/*
|
||||
* In mremap_to().
|
||||
* VMA is moved to dst address, and munmap dst first.
|
||||
* do_munmap will check if dst is sealed.
|
||||
*/
|
||||
ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
|
||||
if (ret)
|
||||
goto out;
|
||||
|
@ -1061,6 +1079,19 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
|
|||
goto out;
|
||||
}
|
||||
|
||||
/*
|
||||
* Below is shrink/expand case (not mremap_to())
|
||||
* Check if src address is sealed, if so, reject.
|
||||
* In other words, prevent shrinking or expanding a sealed VMA.
|
||||
*
|
||||
* Place can_modify_mm here so we can keep the logic related to
|
||||
* shrink/expand together.
|
||||
*/
|
||||
if (unlikely(!can_modify_mm(mm, addr, addr + old_len))) {
|
||||
ret = -EPERM;
|
||||
goto out;
|
||||
}
|
||||
|
||||
/*
|
||||
* Always allow a shrinking remap: that just unmaps
|
||||
* the unnecessary pages..
|
||||
|
|
|
@ -0,0 +1,307 @@
|
|||
// SPDX-License-Identifier: GPL-2.0
|
||||
/*
|
||||
* Implement mseal() syscall.
|
||||
*
|
||||
* Copyright (c) 2023,2024 Google, Inc.
|
||||
*
|
||||
* Author: Jeff Xu <jeffxu@chromium.org>
|
||||
*/
|
||||
|
||||
#include <linux/mempolicy.h>
|
||||
#include <linux/mman.h>
|
||||
#include <linux/mm.h>
|
||||
#include <linux/mm_inline.h>
|
||||
#include <linux/mmu_context.h>
|
||||
#include <linux/syscalls.h>
|
||||
#include <linux/sched.h>
|
||||
#include "internal.h"
|
||||
|
||||
static inline bool vma_is_sealed(struct vm_area_struct *vma)
|
||||
{
|
||||
return (vma->vm_flags & VM_SEALED);
|
||||
}
|
||||
|
||||
static inline void set_vma_sealed(struct vm_area_struct *vma)
|
||||
{
|
||||
vm_flags_set(vma, VM_SEALED);
|
||||
}
|
||||
|
||||
/*
|
||||
* check if a vma is sealed for modification.
|
||||
* return true, if modification is allowed.
|
||||
*/
|
||||
static bool can_modify_vma(struct vm_area_struct *vma)
|
||||
{
|
||||
if (unlikely(vma_is_sealed(vma)))
|
||||
return false;
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
static bool is_madv_discard(int behavior)
|
||||
{
|
||||
return behavior &
|
||||
(MADV_FREE | MADV_DONTNEED | MADV_DONTNEED_LOCKED |
|
||||
MADV_REMOVE | MADV_DONTFORK | MADV_WIPEONFORK);
|
||||
}
|
||||
|
||||
static bool is_ro_anon(struct vm_area_struct *vma)
|
||||
{
|
||||
/* check anonymous mapping. */
|
||||
if (vma->vm_file || vma->vm_flags & VM_SHARED)
|
||||
return false;
|
||||
|
||||
/*
|
||||
* check for non-writable:
|
||||
* PROT=RO or PKRU is not writeable.
|
||||
*/
|
||||
if (!(vma->vm_flags & VM_WRITE) ||
|
||||
!arch_vma_access_permitted(vma, true, false, false))
|
||||
return true;
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
/*
|
||||
* Check if the vmas of a memory range are allowed to be modified.
|
||||
* the memory ranger can have a gap (unallocated memory).
|
||||
* return true, if it is allowed.
|
||||
*/
|
||||
bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end)
|
||||
{
|
||||
struct vm_area_struct *vma;
|
||||
|
||||
VMA_ITERATOR(vmi, mm, start);
|
||||
|
||||
/* going through each vma to check. */
|
||||
for_each_vma_range(vmi, vma, end) {
|
||||
if (unlikely(!can_modify_vma(vma)))
|
||||
return false;
|
||||
}
|
||||
|
||||
/* Allow by default. */
|
||||
return true;
|
||||
}
|
||||
|
||||
/*
|
||||
* Check if the vmas of a memory range are allowed to be modified by madvise.
|
||||
* the memory ranger can have a gap (unallocated memory).
|
||||
* return true, if it is allowed.
|
||||
*/
|
||||
bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long end,
|
||||
int behavior)
|
||||
{
|
||||
struct vm_area_struct *vma;
|
||||
|
||||
VMA_ITERATOR(vmi, mm, start);
|
||||
|
||||
if (!is_madv_discard(behavior))
|
||||
return true;
|
||||
|
||||
/* going through each vma to check. */
|
||||
for_each_vma_range(vmi, vma, end)
|
||||
if (unlikely(is_ro_anon(vma) && !can_modify_vma(vma)))
|
||||
return false;
|
||||
|
||||
/* Allow by default. */
|
||||
return true;
|
||||
}
|
||||
|
||||
static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
|
||||
struct vm_area_struct **prev, unsigned long start,
|
||||
unsigned long end, vm_flags_t newflags)
|
||||
{
|
||||
int ret = 0;
|
||||
vm_flags_t oldflags = vma->vm_flags;
|
||||
|
||||
if (newflags == oldflags)
|
||||
goto out;
|
||||
|
||||
vma = vma_modify_flags(vmi, *prev, vma, start, end, newflags);
|
||||
if (IS_ERR(vma)) {
|
||||
ret = PTR_ERR(vma);
|
||||
goto out;
|
||||
}
|
||||
|
||||
set_vma_sealed(vma);
|
||||
out:
|
||||
*prev = vma;
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Check for do_mseal:
|
||||
* 1> start is part of a valid vma.
|
||||
* 2> end is part of a valid vma.
|
||||
* 3> No gap (unallocated address) between start and end.
|
||||
* 4> map is sealable.
|
||||
*/
|
||||
static int check_mm_seal(unsigned long start, unsigned long end)
|
||||
{
|
||||
struct vm_area_struct *vma;
|
||||
unsigned long nstart = start;
|
||||
|
||||
VMA_ITERATOR(vmi, current->mm, start);
|
||||
|
||||
/* going through each vma to check. */
|
||||
for_each_vma_range(vmi, vma, end) {
|
||||
if (vma->vm_start > nstart)
|
||||
/* unallocated memory found. */
|
||||
return -ENOMEM;
|
||||
|
||||
if (vma->vm_end >= end)
|
||||
return 0;
|
||||
|
||||
nstart = vma->vm_end;
|
||||
}
|
||||
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
/*
|
||||
* Apply sealing.
|
||||
*/
|
||||
static int apply_mm_seal(unsigned long start, unsigned long end)
|
||||
{
|
||||
unsigned long nstart;
|
||||
struct vm_area_struct *vma, *prev;
|
||||
|
||||
VMA_ITERATOR(vmi, current->mm, start);
|
||||
|
||||
vma = vma_iter_load(&vmi);
|
||||
/*
|
||||
* Note: check_mm_seal should already checked ENOMEM case.
|
||||
* so vma should not be null, same for the other ENOMEM cases.
|
||||
*/
|
||||
prev = vma_prev(&vmi);
|
||||
if (start > vma->vm_start)
|
||||
prev = vma;
|
||||
|
||||
nstart = start;
|
||||
for_each_vma_range(vmi, vma, end) {
|
||||
int error;
|
||||
unsigned long tmp;
|
||||
vm_flags_t newflags;
|
||||
|
||||
newflags = vma->vm_flags | VM_SEALED;
|
||||
tmp = vma->vm_end;
|
||||
if (tmp > end)
|
||||
tmp = end;
|
||||
error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
|
||||
if (error)
|
||||
return error;
|
||||
nstart = vma_iter_end(&vmi);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* mseal(2) seals the VM's meta data from
|
||||
* selected syscalls.
|
||||
*
|
||||
* addr/len: VM address range.
|
||||
*
|
||||
* The address range by addr/len must meet:
|
||||
* start (addr) must be in a valid VMA.
|
||||
* end (addr + len) must be in a valid VMA.
|
||||
* no gap (unallocated memory) between start and end.
|
||||
* start (addr) must be page aligned.
|
||||
*
|
||||
* len: len will be page aligned implicitly.
|
||||
*
|
||||
* Below VMA operations are blocked after sealing.
|
||||
* 1> Unmapping, moving to another location, and shrinking
|
||||
* the size, via munmap() and mremap(), can leave an empty
|
||||
* space, therefore can be replaced with a VMA with a new
|
||||
* set of attributes.
|
||||
* 2> Moving or expanding a different vma into the current location,
|
||||
* via mremap().
|
||||
* 3> Modifying a VMA via mmap(MAP_FIXED).
|
||||
* 4> Size expansion, via mremap(), does not appear to pose any
|
||||
* specific risks to sealed VMAs. It is included anyway because
|
||||
* the use case is unclear. In any case, users can rely on
|
||||
* merging to expand a sealed VMA.
|
||||
* 5> mprotect and pkey_mprotect.
|
||||
* 6> Some destructive madvice() behavior (e.g. MADV_DONTNEED)
|
||||
* for anonymous memory, when users don't have write permission to the
|
||||
* memory. Those behaviors can alter region contents by discarding pages,
|
||||
* effectively a memset(0) for anonymous memory.
|
||||
*
|
||||
* flags: reserved.
|
||||
*
|
||||
* return values:
|
||||
* zero: success.
|
||||
* -EINVAL:
|
||||
* invalid input flags.
|
||||
* start address is not page aligned.
|
||||
* Address arange (start + len) overflow.
|
||||
* -ENOMEM:
|
||||
* addr is not a valid address (not allocated).
|
||||
* end (start + len) is not a valid address.
|
||||
* a gap (unallocated memory) between start and end.
|
||||
* -EPERM:
|
||||
* - In 32 bit architecture, sealing is not supported.
|
||||
* Note:
|
||||
* user can call mseal(2) multiple times, adding a seal on an
|
||||
* already sealed memory is a no-action (no error).
|
||||
*
|
||||
* unseal() is not supported.
|
||||
*/
|
||||
static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
|
||||
{
|
||||
size_t len;
|
||||
int ret = 0;
|
||||
unsigned long end;
|
||||
struct mm_struct *mm = current->mm;
|
||||
|
||||
ret = can_do_mseal(flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
start = untagged_addr(start);
|
||||
if (!PAGE_ALIGNED(start))
|
||||
return -EINVAL;
|
||||
|
||||
len = PAGE_ALIGN(len_in);
|
||||
/* Check to see whether len was rounded up from small -ve to zero. */
|
||||
if (len_in && !len)
|
||||
return -EINVAL;
|
||||
|
||||
end = start + len;
|
||||
if (end < start)
|
||||
return -EINVAL;
|
||||
|
||||
if (end == start)
|
||||
return 0;
|
||||
|
||||
if (mmap_write_lock_killable(mm))
|
||||
return -EINTR;
|
||||
|
||||
/*
|
||||
* First pass, this helps to avoid
|
||||
* partial sealing in case of error in input address range,
|
||||
* e.g. ENOMEM error.
|
||||
*/
|
||||
ret = check_mm_seal(start, end);
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
/*
|
||||
* Second pass, this should success, unless there are errors
|
||||
* from vma_modify_flags, e.g. merge/split error, or process
|
||||
* reaching the max supported VMAs, however, those cases shall
|
||||
* be rare.
|
||||
*/
|
||||
ret = apply_mm_seal(start, end);
|
||||
|
||||
out:
|
||||
mmap_write_unlock(current->mm);
|
||||
return ret;
|
||||
}
|
||||
|
||||
SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
|
||||
flags)
|
||||
{
|
||||
return do_mseal(start, len, flags);
|
||||
}
|
|
@ -47,3 +47,5 @@ mkdirty
|
|||
va_high_addr_switch
|
||||
hugetlb_fault_after_madv
|
||||
hugetlb_madv_vs_map
|
||||
mseal_test
|
||||
seal_elf
|
||||
|
|
|
@ -59,6 +59,8 @@ TEST_GEN_FILES += mlock2-tests
|
|||
TEST_GEN_FILES += mrelease_test
|
||||
TEST_GEN_FILES += mremap_dontunmap
|
||||
TEST_GEN_FILES += mremap_test
|
||||
TEST_GEN_FILES += mseal_test
|
||||
TEST_GEN_FILES += seal_elf
|
||||
TEST_GEN_FILES += on-fault-limit
|
||||
TEST_GEN_FILES += pagemap_ioctl
|
||||
TEST_GEN_FILES += thuge-gen
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,179 @@
|
|||
// SPDX-License-Identifier: GPL-2.0
|
||||
#define _GNU_SOURCE
|
||||
#include <sys/mman.h>
|
||||
#include <stdint.h>
|
||||
#include <unistd.h>
|
||||
#include <string.h>
|
||||
#include <sys/time.h>
|
||||
#include <sys/resource.h>
|
||||
#include <stdbool.h>
|
||||
#include "../kselftest.h"
|
||||
#include <syscall.h>
|
||||
#include <errno.h>
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <sys/vfs.h>
|
||||
#include <sys/stat.h>
|
||||
|
||||
/*
|
||||
* need those definition for manually build using gcc.
|
||||
* gcc -I ../../../../usr/include -DDEBUG -O3 -DDEBUG -O3 seal_elf.c -o seal_elf
|
||||
*/
|
||||
#define FAIL_TEST_IF_FALSE(c) do {\
|
||||
if (!(c)) {\
|
||||
ksft_test_result_fail("%s, line:%d\n", __func__, __LINE__);\
|
||||
goto test_end;\
|
||||
} \
|
||||
} \
|
||||
while (0)
|
||||
|
||||
#define SKIP_TEST_IF_FALSE(c) do {\
|
||||
if (!(c)) {\
|
||||
ksft_test_result_skip("%s, line:%d\n", __func__, __LINE__);\
|
||||
goto test_end;\
|
||||
} \
|
||||
} \
|
||||
while (0)
|
||||
|
||||
|
||||
#define TEST_END_CHECK() {\
|
||||
ksft_test_result_pass("%s\n", __func__);\
|
||||
return;\
|
||||
test_end:\
|
||||
return;\
|
||||
}
|
||||
|
||||
#ifndef u64
|
||||
#define u64 unsigned long long
|
||||
#endif
|
||||
|
||||
/*
|
||||
* define sys_xyx to call syscall directly.
|
||||
*/
|
||||
static int sys_mseal(void *start, size_t len)
|
||||
{
|
||||
int sret;
|
||||
|
||||
errno = 0;
|
||||
sret = syscall(__NR_mseal, start, len, 0);
|
||||
return sret;
|
||||
}
|
||||
|
||||
static void *sys_mmap(void *addr, unsigned long len, unsigned long prot,
|
||||
unsigned long flags, unsigned long fd, unsigned long offset)
|
||||
{
|
||||
void *sret;
|
||||
|
||||
errno = 0;
|
||||
sret = (void *) syscall(__NR_mmap, addr, len, prot,
|
||||
flags, fd, offset);
|
||||
return sret;
|
||||
}
|
||||
|
||||
static inline int sys_mprotect(void *ptr, size_t size, unsigned long prot)
|
||||
{
|
||||
int sret;
|
||||
|
||||
errno = 0;
|
||||
sret = syscall(__NR_mprotect, ptr, size, prot);
|
||||
return sret;
|
||||
}
|
||||
|
||||
static bool seal_support(void)
|
||||
{
|
||||
int ret;
|
||||
void *ptr;
|
||||
unsigned long page_size = getpagesize();
|
||||
|
||||
ptr = sys_mmap(NULL, page_size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
|
||||
if (ptr == (void *) -1)
|
||||
return false;
|
||||
|
||||
ret = sys_mseal(ptr, page_size);
|
||||
if (ret < 0)
|
||||
return false;
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
const char somestr[4096] = {"READONLY"};
|
||||
|
||||
static void test_seal_elf(void)
|
||||
{
|
||||
int ret;
|
||||
FILE *maps;
|
||||
char line[512];
|
||||
uintptr_t addr_start, addr_end;
|
||||
char prot[5];
|
||||
char filename[256];
|
||||
unsigned long page_size = getpagesize();
|
||||
unsigned long long ptr = (unsigned long long) somestr;
|
||||
char *somestr2 = (char *)somestr;
|
||||
|
||||
/*
|
||||
* Modify the protection of readonly somestr
|
||||
*/
|
||||
if (((unsigned long long)ptr % page_size) != 0)
|
||||
ptr = (unsigned long long)ptr & ~(page_size - 1);
|
||||
|
||||
ksft_print_msg("somestr = %s\n", somestr);
|
||||
ksft_print_msg("change protection to rw\n");
|
||||
ret = sys_mprotect((void *)ptr, page_size, PROT_READ|PROT_WRITE);
|
||||
FAIL_TEST_IF_FALSE(!ret);
|
||||
*somestr2 = 'A';
|
||||
ksft_print_msg("somestr is modified to: %s\n", somestr);
|
||||
ret = sys_mprotect((void *)ptr, page_size, PROT_READ);
|
||||
FAIL_TEST_IF_FALSE(!ret);
|
||||
|
||||
maps = fopen("/proc/self/maps", "r");
|
||||
FAIL_TEST_IF_FALSE(maps);
|
||||
|
||||
/*
|
||||
* apply sealing to elf binary
|
||||
*/
|
||||
while (fgets(line, sizeof(line), maps)) {
|
||||
if (sscanf(line, "%lx-%lx %4s %*x %*x:%*x %*u %255[^\n]",
|
||||
&addr_start, &addr_end, prot, filename) == 4) {
|
||||
if (strlen(filename)) {
|
||||
/*
|
||||
* seal the mapping if read only.
|
||||
*/
|
||||
if (strstr(prot, "r-")) {
|
||||
ret = sys_mseal((void *)addr_start, addr_end - addr_start);
|
||||
FAIL_TEST_IF_FALSE(!ret);
|
||||
ksft_print_msg("sealed: %lx-%lx %s %s\n",
|
||||
addr_start, addr_end, prot, filename);
|
||||
if ((uintptr_t) somestr >= addr_start &&
|
||||
(uintptr_t) somestr <= addr_end)
|
||||
ksft_print_msg("mapping for somestr found\n");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
fclose(maps);
|
||||
|
||||
ret = sys_mprotect((void *)ptr, page_size, PROT_READ | PROT_WRITE);
|
||||
FAIL_TEST_IF_FALSE(ret < 0);
|
||||
ksft_print_msg("somestr is sealed, mprotect is rejected\n");
|
||||
|
||||
TEST_END_CHECK();
|
||||
}
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
bool test_seal = seal_support();
|
||||
|
||||
ksft_print_header();
|
||||
ksft_print_msg("pid=%d\n", getpid());
|
||||
|
||||
if (!test_seal)
|
||||
ksft_exit_skip("sealing not supported, check CONFIG_64BIT\n");
|
||||
|
||||
ksft_set_plan(1);
|
||||
|
||||
test_seal_elf();
|
||||
|
||||
ksft_finished();
|
||||
}
|
Loading…
Reference in New Issue