I am running a laptop with Arch Linux, kernel 6.18.5, and relatively low 8 GB of RAM and 4 GB of swap. When a Linux system runs out of memory, it sometimes freezes: it stops responding to keyboard shortcuts and cursor stops moving. With luck, eventually OOM killer kills some process like Firefox and some memory is freed, but sometimes the whole systemd freezes for a long time and has to be reset.
As far as I understand the reason, this happens because when Linux runs out of memory, it unloads binaries that are backed by ELF files out of memory. In addition, data pages not backed by the files, called anonymous pages, are unloaded to swap. When the system runs out of memory, executable pages of important programs that process the UI are constantly unloaded and reloaded back from ELF files.
The problem of page unloading and reloading slowing down the system is not specific to Linux, it is known as thrashing.
malloc should fail
as it will try to call mmap and the kernel
should refuse to allocate another page if there is no memory left.
One perspective on this is the article
"What is Overcommit? And why is it bad?" written by musl libc author,
followed up by a blog post To overcommit or not to overcommit.
It argues that overcommit is bad because it encourages writing programs that don't handle memory allocation errors well.
You can disable overcommit by running sysctl vm.overcommit_memory=2.
The value 2 corresponds to the OVERCOMMIT_NEVER from include/uapi/linux/mman.h.
I tried it, and it immediately killed waybar, firefox, three alacritty terminals,
gdbus, emacs (which is running in background for emacsclient) and vim (in which I was writing this)
because apparently way more memory was virtually allocated already than the system has.
After restarting everything, I tried to open Firefox, terminal with vim,
Electron-based Delta Chat Desktop, and then waybar crashed.
In dmesg this line appeared:
__vm_enough_memory: pid: 2564294, comm: waybar, bytes: 4096 not enough memory for the allocation.
So nothing was killed by OOM-killer, but because waybar does not handle memory allocation errors, it crashed.
Overall, it is still true more than 10 years after the posts mentioned about that programs allocate way more memory that they need and crash on allocation errors even when their only job is displaying the clock, sound volume, Wi-Fi network name and currently selected workspace. I don't understand why this happens and what can be done to fix it. It could be that Firefox does its own memory allocations, and e.g. pre-allocates huge arenas in advance, in which case the problem can only be fixed in Firefox itself. It could also be that this is the behavior of malloc implementation, and just swapping malloc system-wide to the one that does not allocate writable pages in advance can help.
It is also possible to increase vm.overcommit_ratio from the default value of 50
or increase vm.overcommit_kbytes from the default value of 0 so memory allocation
starts failing later and user processes can use more than 8 GB (swap size + 50% of RAM),
but this cannot solve the problem of software that cannot handle memory allocation errors.
I reverted back to sysctl vm.overcommit_memory=0 by rebooting.
On of the bad advices that I immediately dismiss is disabling the swap. Disabling swap does not solve the thrashing problem because Linux will still unload the binary files, e.g. the Wayland compositor responsible for the UI, and will make out-of-memory problem worse as unused data cannot be offloaded into swap.
I have also seen recommendations to enable zswap or zram.
It turns out I already have zswap enabled
as reading from /sys/module/zswap/parameters/enabled
returns Y.
It might still be worth understanding the swap and options around it
such as vm.swappiness, but I have not looked
into it as it does not directly solve the problem.
One manual solution is to lock memory pages of important programs,
that are responsible for the UI.
One example of the program that does this is
memlockd
which works by mapping important ELF files into its own memory
and calling mlock() on them.
The program is originally designed to lock SSH daemon and shell into memory,
so you can log into an out of memory server,
but can also be used on desktop if you lock your UI
such as Wayland compositor or Xorg
and necessary dynamically linked libraries.
memlockd can recursively lock all shared libraries
if you prefix the line with +
I have previously used it to lock Xorg and dwm on another system and it seemingly worked to avoid cursor freezing.
I did not set it up this time though, memlockd is not directly available on Arch Linux in the package manager and does not look like a proper solution as it only locks executables but data pages may still get swapped, and it will lock unused shared libraries into memory.
There are several user-space daemons that attempt to detect thrashing or out-of-memory situations early and kill processes before OOM killer is triggered. OOM killer sometimes takes a long time to kill any process and the system is in a frozen state for a while. systemd-oomd turned out to be preinstalled already on my system, but not enabled. systemd-oomd can use swap usage and PSI (Pressure Stall Information) to detect thrashing.
Pressure Stall Information, added in Linux 4.20 in 2018, exports a number of files into /proc/pressure/.
Most interesting is /proc/pressure/memory as it allows to detect thrashing
by tracking the time all active processes spent waiting for page fault processing.
This is what systemd-oomd uses.
Older projects created before introduction of PSI such as earlyoom cannot detect thrashing directly.
According to earlyoom readme it takes a very simple approach of polling the free memory and swap
and killing a process with the highest oom_score.
This does not look reliable as earlyoom has no way to know how much memory is actually used.
If I setup such userspace daemon, I would enable systemd-oomd.
It looks like systemd-oomd needs some configuration to define
which processes it can kill and just enabling it is not enough.
I have also found some reports from the time when Fedora enabled it by default as a replacement for earlyoom that it killed the whole session.
Apparently systemd-oomd selects the process to kill not by oom_score, but by pressure,
so when out of memory, the process which suffers from pressure, like the Wayland compositor,
gets killed instead of the process that caused the system to run out of memory, like a web browser or Electron app.
Maybe it can be configured properly, at least by excluding importart programs manually,
but I don't run my Wayland session manually from TTY by running exec ssh-agent niri-session
and not from systemd, so I have postponed enabling systemd-oomd for now.
Multi-generational LRU
is a patchset changing the algorithm determining which pages to reclaim
that has been in development since 2021
and after some updates got merged in 2022 in Linux 6.1.
It is enabled on my system as reading from /sys/kernel/mm/lru_gen/enabled returns 0x0007.
One feature of multi-gen LRU is built-in thrashing prevention.
According to the documentation, it triggers kernel OOM killer if the "working set" of pages used recently, in the specified number of milliseconds, cannot be kept in memory.
This sounds like a solution combining the advantages of systemd-oomd and earlyoom
as it is triggered by the measurement built directly into the paging system that reacts to thrashing (unlike earlyoom that measures used memory)
and kills the processes according to OOM score (unlike systemd-oomd which kills cgroup with the highest memory pressure).
It can be enabled by running echo 1000 > /sys/kernel/mm/lru_gen/min_ttl_ms as root.
To persist this change, I created /etc/systemd/system/mglru.service with the following contents:
[Unit] Description=Enable Multi-Gen LRU and thrashing prevention ConditionPathExists=/sys/kernel/mm/lru_gen/enabled Documentation=https://www.kernel.org/doc/html/v6.18/admin-guide/mm/multigen_lru.html [Service] Type=oneshot ExecStart=/bin/sh -c "echo y >/sys/kernel/mm/lru_gen/enabled; echo 1000 >/sys/kernel/mm/lru_gen/min_ttl_ms" [Install] WantedBy=default.target
Then I ran systemctl daemon-reload and systemctl enable --now mglru.service.
I have not tested this extensively, but running compilation resulted in killing electron (Delta Chat Desktop started from Alacritty) early without freezing the system.
These lines appeared at the end of dmesg output:
[38238.240642] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-niri-alacritty-2877.scope,task=electron,pid=3047,uid=1000 [38238.240779] Out of memory: Killed process 3047 (electron) total-vm:1459706252kB, anon-rss:118356kB, file-rss:7380kB, shmem-rss:508kB, UID:1000 pgtables:2224kB oom_score_adj:300
I then discovered that there is an easier way to manage /sys/ configuration
by creating a file in tmpfiles.d
and letting systemd-tmpfiles manage it.
I removed /etc/systemd/system/mglru.service and created /etc/tmpfiles.d/mglru.conf
with the following contents:
#Type Path Mode User Group Age Argument w- /sys/kernel/mm/lru_gen/enabled - - - - y w- /sys/kernel/mm/lru_gen/min_ttl_ms - - - - 1000This configuration is successfully applied after reboot.
After looking at all the current solutions to out-of-memory freezing the Linux system, I have enabled built-in thrashing prevention and it appears to solve the problem without having to run user-space daemons and extensive configuration.