One of the things that I'm interested in, but don't really understand at this point, is the Linux kernel's out-of-memory killer. In this post, I'll try to fix that!
The out of memory killer is a part of the Linux kernel is used when the system is almost completely out of memory (RAM), and needs to free some up. It does this by killing programs, thus reclaiming the memory that they were using. Having spent a lot of time working on a machine with 512MB of RAM, I'm pretty familiar with the OOM killer getting activated, but I don't really understand how it works. In this post, I'm going to go through the source code for the OOM killer to try to figure out how it works!
First, we need to actually get the source for the linux kernel. This is located at git.kernel.org. Specifically, we want kernel/git/torvalds/linux.git. The file that contains the OOM killer code is mm/oom_kill.c
. I think that mm
stands for "memory management". While I'm writing this, the git commit that I'm looking at is ba6d973f78
. Since the file is more than 1000 lines long, we'll just look at a few functions:
I see from scrolling through it that oom_badness
looks interesting - according to the comment this determines what process to kill!
Let's look at the function signature:
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
const nodemask_t *nodemask, unsigned long totalpages)
Here's what I get from this:
- The score that it returns is a unsigned long - I wonder what it's max value is, or if it even has a maximum.
- It takes four arguments:
- A
task_struct
calledp
, presumably standing for "process".task_struct
is defined inincludes/linux/sched.h:1501
. This contains a ton of stuff! But the important part is that it represents a process. - A
mem_cgroup
variable. CGroups in Linux are a way to sandbox different resources - in the case of a memory cgroup, you could limit a specific process or set of processes to, say, only taking up 512MB of RAM. Presumably the memory cgroup that the process is is is taken into account somewhere. - nodemask: I'm not quite sure what this is.
- totalpages: This is the total amount of RAM that the system has.
- A
Next we, create a couple of variables:
{
long points;
long adj;
points
keeps track of the number of 'badness points' that this task has - this value is what we will end up returning
adj
is more complicated - but it seems like it has to do with the oom_score_adj
value. I know that this value exists in /proc/<PID>/oom_score_adj
, so let's look at man procfs
to figure this out!
/proc/[pid]/oom_score_adj (since Linux 2.6.36)
This file can be used to adjust the badness heuristic used to select which
process gets killed in out-of-memory conditions.
The badness heuristic assigns a value to each candidate task ranging from 0
(never kill) to 1000 (always kill) to determine which process is targeted. The
units are roughly a proportion along that range of allowed memory the process
may allocate from, based on an estimation of its current memory and swap use.
For example, if a task is using all allowed memory, its badness score will be
1000. If it is using half of its allowed memory, its score will be 500.
...
The value of oom_score_adj is added to the badness score before it is used to
determine which task to kill. Acceptable values range from -1000
(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows user space to
control the preference for OOM-killing, ranging from always preferring a certain
task or completely disabling it from OOM killing. The lowest possible value,
-1000, is equivalent to disabling OOM-killing entirely for that task, since it
will always report a badness score of 0.
Consequently, it is very simple for user space to define the amount of memory to
consider for each task. Setting a oom_score_adj value of +500, for example, is
roughly equivalent to allowing the remainder of tasks sharing the same system,
cpuset, mempolicy, or memory controller resources to use at least 50% more mem‐
ory. A value of -500, on the other hand, would be roughly equivalent to dis‐
counting 50% of the task's allowed memory from being considered as scoring
against the task.
So presumably this value will contain something relating the adjustment value!
Let's continue:
if (oom_unkillable_task(p, memcg, nodemask))
return 0;
Pretty simple - if it's unkillable, return zero (which makes sure that it won't be killed)
p = find_lock_task_mm(p);
find_lock_task_mm
will make sure that the task p has a ->mm
member, returning NULL
if it can find any subthread with a valid ->mm
.
Returning null causes us to mark the task not to be killed:
if (!p)
return 0;
If we can find a subthread that has a valid mm
member, we continue on:
/*
* Do not even consider tasks which are explicitly marked oom
* unkillable or have been already oom reaped or the are in
* the middle of vfork
*/
A few things to note here:
- It's interesting that being "explicitly marked oom unkillable" Isn't checked in
oom_unkilable_task()
. I wonder why that is. - I'm not quite sure the reason for not oom-killing tasks that are
vfork
ing. My guess would be that it's because the task that should actually be killed is the parent, but I'm not sure about that.
Let's go on:
adj = (long)p->signal->oom_score_adj;
And here we figure out exactly what adj
represents! It is in fact the oom adjustment value that the man page talks about.
if (adj == OOM_SCORE_ADJ_MIN ||
test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
in_vfork(p)) {
task_unlock(p);
return 0;
}
This if statement implements what the comment says it does - pretty simple. I'm not sure why task_unlock()
is called in this case - it doesn't seem to have to do with anything in this if statement, but there it is!
Moving on to the meat of the 'badness function', we have:
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);
task_unlock(p);
The line that sets points
is pretty simple - the number of 'badness points' a process starts with is equal to the sum of three things:
- It's RSS (Resident Set Size). This is a measure of memory that a process is using. This includes shared libraries, stack, and heap, but not swap.
- It's pagetable use - The page table is how Linux allows every program that is running to have it's own block of memory that appears to be continuous. Each page in the pagetable takes up some RAM, which is why we're taking it into account.
- It's swap usage. Swap is hard drive space that is used like RAM once the system runs out of RAM. We add this because it isn't included in the RSS.
If you're reading closely, you'll notice that the comment states that we're summing three things, but the statement actually sums four! This is because the page table entries are separated into two things - PTE table pages and PMD table pages. The reason that this distinction exists is because different architectures handle memory management differently, and linux chooses to abstract it in this way. In any case, we need to add together both the PTE pages and the PMD pages to get the total number of pages.
Next, there's the task_unlock()
line. This explains why there was a task_unlock()
in the if statement above - it looks like the task must be unlocked in this function, so because the function returned in that if statement, it had to task_unlock()
then. I'm still not sure why the task needs to be unlocked, but that's a question for another day!
Let's look at the next part of the code:
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
The "LSMs" that they refer to are, I think, Linux Security Modules.
This code simply makes processes run by root slightly less likely to be oom-killed.
Moving on, we apply the adjustment from oom_score_adj
:
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;
/*
* Never return 0 for an eligible task regardless of the root bonus and
* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
*/
return points > 0 ? points : 1;
}
The most complicated part of this is the adj *= totalpages / 1000
. I think that the reason for the totalpages
in this is that the values that we assigned to points
previously were measured in actual memory usage, while the value of oom_score_adj
is a static ±1000. This equation converts the value in adj to something that takes into consideration the total amount of RAM available.
The checking in the return statement just makes sure that the return value is always 1 or greater. This is interesting, because on my system, there are quite a few processes that shouldbe eligible, but have a oom_score
of 0 (You can find the oom score of a running process with cat /proc/<PID>/oom_score
). My system may be running a different version of this function, but I'm not sure. A mystery for another day!
Anyway, that's it for now! I like this format for looking at parts of the kernel, since it gives me a lot of jumping off points and other questions to ask! Here's a few questions that I have now:
- What actually calls the oom killer code?
- What exactly is the page table? How is it different on x86 vs ARM?
- How does LSM work? What hooks does it provide? How do those hooks work?
- How does PaX work?
- How does memory management work on a
RT_PREEMPT
system? Does anything about the oom killer change? - How does RSS compare to other ways of measuring memory taken, like VSZ or USS?
- Why are there
goto
s in the kernel source? Is it just bad practice, or is it faster/better? - How do spinlocks work? Does the kernel use any other types of locks?
- How does RCU work?
- What happens if pid 1 is the only process, but the oom killer gets called?
All interesting things to learn about!