3 minute read

If you’ve worked with containers before, you have also probably set some resource limits for the containers. For example, to set the memory limit for a container in a Kubernetes pod, you’d specify something like:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: myapp
    image: myimage
    resources:
      limits:
        memory: "128Mi"

Have you ever wondered how the operating system makes sure that this particular container doesn’t indeed use up more than 128Mi memory? It’s through a feature of the Linux kernel called Control groups or simply, cgroup. Let’s dig deeper into cgroups.

Note: The discussion here is about cgroup v2. Although the core idea is the same, how cgroups are organized is a bit different in cgroup v1.

Control groups allow processes to be put into groups and limit the resource usage(eg. CPU, memory, Network io, etc) for those groups. The information about control groups is in /sys/fs/cgroup directory, and cgroups can be listed with command lscgroup.

To see cgroups in action, let’s create a docker container with memory limit 6 MB.

docker run -d nginx --memory 6m

Let’s get the ID of the container:

root@sbh:~# docker ps
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS     NAMES
79440b327c3c   nginx     "/docker-entrypoint.…"   6 minutes ago   Up 6 minutes   80/tcp    naughty_merkle

It’s 79440b.....

Using lscgroup, let’s look for the control group for this container:

root@sbh:~# lscgroup | grep docker
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/system.slice/docker.service
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/system.slice/docker.socket
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/system.slice/docker-79440b327c3c776998dfdc34689e276afbbc9989ea9dc87602acd8848e0f2fb1.scope

We see that the cgroup information for this container is in /system.slice/docker-79440b327c3c776998dfdc34689e276afbbc9989ea9dc87602acd8848e0f2fb1.scope. Let’s look inside this hierarchy inside /sys/fs/cgroup where all the cgroups are located. Let’s specifically look for the memory limit. The memory limit, in bytes, is in a file memory.max inside the cgroup directory.

root@sbh:/sys/fs/cgroup# cat system.slice/docker-79440b327c3c776998dfdc34689e276afbbc9989ea9dc87602acd8848e0f2fb1.scope/memory.max 
6291456

It’s precisely 6 Megabytes.

To get a bit more transparency, let’s create our own cgroup. To do this, we’ll simply create a directory in the /sys/fs/cgroup directory. Let’s create a demo-cgroup cgroup.

root@sbh:/sys/fs/cgroup# mkdir demo-cgroup

If we check using lscgroup, we can see that the cgroup has been created:

root@sbh:/sys/fs/cgroup/demo-cgroup# lscgroup | grep demo
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/demo-cgroup

As soon as we created this directory, the kernel populates the cgroup with default values for all the resource limits. Let’s look inside the directory:

root@sbh:/sys/fs/cgroup# cd demo-cgroup/
root@sbh:/sys/fs/cgroup/demo-cgroup# ls
cgroup.controllers      cgroup.subtree_control  cpu.uclamp.min         hugetlb.1GB.events        hugetlb.2MB.max           memory.current       memory.peak          memory.zswap.current  rdma.current
cgroup.events           cgroup.threads          cpu.weight             hugetlb.1GB.events.local  hugetlb.2MB.numa_stat     memory.events        memory.pressure      memory.zswap.max      rdma.max
cgroup.freeze           cgroup.type             cpu.weight.nice        hugetlb.1GB.max           hugetlb.2MB.rsvd.current  memory.events.local  memory.reclaim       misc.current
cgroup.kill             cpu.idle                cpuset.cpus            hugetlb.1GB.numa_stat     hugetlb.2MB.rsvd.max      memory.high          memory.stat          misc.events
cgroup.max.depth        cpu.max                 cpuset.cpus.effective  hugetlb.1GB.rsvd.current  io.max                    memory.low           memory.swap.current  misc.max
cgroup.max.descendants  cpu.max.burst           cpuset.cpus.partition  hugetlb.1GB.rsvd.max      io.pressure               memory.max           memory.swap.events   pids.current
cgroup.pressure         cpu.pressure            cpuset.mems            hugetlb.2MB.current       io.prio.class             memory.min           memory.swap.high     pids.events
cgroup.procs            cpu.stat                cpuset.mems.effective  hugetlb.2MB.events        io.stat                   memory.numa_stat     memory.swap.max      pids.max
cgroup.stat             cpu.uclamp.max          hugetlb.1GB.current    hugetlb.2MB.events.local  io.weight                 memory.oom.group     memory.swap.peak     pids.peak

If we look for the memory limit:

root@sbh:/sys/fs/cgroup/demo-cgroup# cat memory.max
max

This means that there’s no memory limit, the processes in this cgroup can use as much memory as the host has free.

But, there are no processes in our new cgroup. If we look for the pid count, it’s 0:

root@sbh:/sys/fs/cgroup/demo-cgroup# cat pids.current 
0

Also, if we look for the process ids in cgroup.procs, there’s nothing:

root@sbh:/sys/fs/cgroup/demo-cgroup# cat cgroup.procs 
root@sbh:/sys/fs/cgroup/demo-cgroup#

Let’s change this. We will create a new sh shell process and add it to our new demo-cgroup control group. In a separate terminal, let’s create the process:

root@sbh:~# sh
# echo $$
8149
# 

Our new shell process has process ID 8149. Let’s add the process to the demo-cgroup cgroup.

root@sbh:/sys/fs/cgroup/demo-cgroup# echo 8149 >> cgroup.procs 

Looking at pids.current now, it shows 1 as we added one process to the cgroup:

root@sbh:/sys/fs/cgroup/demo-cgroup# cat pids.current 
1

In the shell process, we have no memory limit, so we can do anything at present:

# ls
snap
# pwd
/root
# touch file1
# ls
file1  snap
# 

Now, let’s set a memory limit in the cgroup by overwriting the memory.max file:

root@sbh:/sys/fs/cgroup/demo-cgroup# cat memory.max 
max
root@sbh:/sys/fs/cgroup/demo-cgroup# echo 100000 > memory.max 

The new memory limit we set is very small. It’s enough for the shell process to keep running, but not enough to do anything meaningful, not even to list the files and directories:

# 
# ls
Killed
root@sbh:~#  

In trying to execute ls, the process tried to exceed it’s memory limit set in the demo-cgroup control group, and was killed.

This is how the resource limits for containers are handled in Linux.