Control groups: Controlling containers
If you’ve worked with containers before, you have also probably set some resource limits for the containers. For example, to set the memory limit for a container in a Kubernetes pod, you’d specify something like:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: myapp
image: myimage
resources:
limits:
memory: "128Mi"
Have you ever wondered how the operating system makes sure that this particular container doesn’t indeed use up more than 128Mi memory? It’s through a feature of the Linux kernel called Control groups
or simply, cgroup
. Let’s dig deeper into cgroups.
Note: The discussion here is about cgroup v2. Although the core idea is the same, how cgroups are organized is a bit different in cgroup v1.
Control groups allow processes to be put into groups
and limit the resource usage(eg. CPU, memory, Network io, etc) for those groups. The information about control groups is in /sys/fs/cgroup
directory, and cgroups can be listed with command lscgroup
.
To see cgroups in action, let’s create a docker container with memory limit 6 MB.
docker run -d nginx --memory 6m
Let’s get the ID of the container:
root@sbh:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
79440b327c3c nginx "/docker-entrypoint.…" 6 minutes ago Up 6 minutes 80/tcp naughty_merkle
It’s 79440b....
.
Using lscgroup
, let’s look for the control group for this container:
root@sbh:~# lscgroup | grep docker
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/system.slice/docker.service
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/system.slice/docker.socket
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/system.slice/docker-79440b327c3c776998dfdc34689e276afbbc9989ea9dc87602acd8848e0f2fb1.scope
We see that the cgroup information for this container is in /system.slice/docker-79440b327c3c776998dfdc34689e276afbbc9989ea9dc87602acd8848e0f2fb1.scope
. Let’s look inside this hierarchy inside /sys/fs/cgroup
where all the cgroups are located. Let’s specifically look for the memory limit. The memory limit, in bytes, is in a file memory.max
inside the cgroup directory.
root@sbh:/sys/fs/cgroup# cat system.slice/docker-79440b327c3c776998dfdc34689e276afbbc9989ea9dc87602acd8848e0f2fb1.scope/memory.max
6291456
It’s precisely 6 Megabytes.
To get a bit more transparency, let’s create our own cgroup. To do this, we’ll simply create a directory in the /sys/fs/cgroup
directory. Let’s create a demo-cgroup
cgroup.
root@sbh:/sys/fs/cgroup# mkdir demo-cgroup
If we check using lscgroup
, we can see that the cgroup has been created:
root@sbh:/sys/fs/cgroup/demo-cgroup# lscgroup | grep demo
cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:/demo-cgroup
As soon as we created this directory, the kernel populates the cgroup with default values for all the resource limits. Let’s look inside the directory:
root@sbh:/sys/fs/cgroup# cd demo-cgroup/
root@sbh:/sys/fs/cgroup/demo-cgroup# ls
cgroup.controllers cgroup.subtree_control cpu.uclamp.min hugetlb.1GB.events hugetlb.2MB.max memory.current memory.peak memory.zswap.current rdma.current
cgroup.events cgroup.threads cpu.weight hugetlb.1GB.events.local hugetlb.2MB.numa_stat memory.events memory.pressure memory.zswap.max rdma.max
cgroup.freeze cgroup.type cpu.weight.nice hugetlb.1GB.max hugetlb.2MB.rsvd.current memory.events.local memory.reclaim misc.current
cgroup.kill cpu.idle cpuset.cpus hugetlb.1GB.numa_stat hugetlb.2MB.rsvd.max memory.high memory.stat misc.events
cgroup.max.depth cpu.max cpuset.cpus.effective hugetlb.1GB.rsvd.current io.max memory.low memory.swap.current misc.max
cgroup.max.descendants cpu.max.burst cpuset.cpus.partition hugetlb.1GB.rsvd.max io.pressure memory.max memory.swap.events pids.current
cgroup.pressure cpu.pressure cpuset.mems hugetlb.2MB.current io.prio.class memory.min memory.swap.high pids.events
cgroup.procs cpu.stat cpuset.mems.effective hugetlb.2MB.events io.stat memory.numa_stat memory.swap.max pids.max
cgroup.stat cpu.uclamp.max hugetlb.1GB.current hugetlb.2MB.events.local io.weight memory.oom.group memory.swap.peak pids.peak
If we look for the memory limit:
root@sbh:/sys/fs/cgroup/demo-cgroup# cat memory.max
max
This means that there’s no memory limit, the processes in this cgroup can use as much memory as the host has free.
But, there are no processes in our new cgroup. If we look for the pid count, it’s 0:
root@sbh:/sys/fs/cgroup/demo-cgroup# cat pids.current
0
Also, if we look for the process ids in cgroup.procs
, there’s nothing:
root@sbh:/sys/fs/cgroup/demo-cgroup# cat cgroup.procs
root@sbh:/sys/fs/cgroup/demo-cgroup#
Let’s change this. We will create a new sh
shell process and add it to our new demo-cgroup
control group. In a separate terminal, let’s create the process:
root@sbh:~# sh
# echo $$
8149
#
Our new shell process has process ID 8149
. Let’s add the process to the demo-cgroup
cgroup.
root@sbh:/sys/fs/cgroup/demo-cgroup# echo 8149 >> cgroup.procs
Looking at pids.current
now, it shows 1
as we added one process to the cgroup:
root@sbh:/sys/fs/cgroup/demo-cgroup# cat pids.current
1
In the shell process, we have no memory limit, so we can do anything at present:
# ls
snap
# pwd
/root
# touch file1
# ls
file1 snap
#
Now, let’s set a memory limit in the cgroup by overwriting the memory.max
file:
root@sbh:/sys/fs/cgroup/demo-cgroup# cat memory.max
max
root@sbh:/sys/fs/cgroup/demo-cgroup# echo 100000 > memory.max
The new memory limit we set is very small. It’s enough for the shell process to keep running, but not enough to do anything meaningful, not even to list the files and directories:
#
# ls
Killed
root@sbh:~#
In trying to execute ls
, the process tried to exceed it’s memory limit set in the demo-cgroup
control group, and was killed.
This is how the resource limits for containers are handled in Linux.