Namespaces: The ‘contain’ in containers
You can also read this over at towardsk8s.com.
If you’ve ever run a shell inside a container, you realize that a container looks like a completely different machine. The shell process, and by extension you, have a completely different view of the system. But at the end of the day, the container is just another Linux process. Then how does the container seem to have so much isolation? It’s achieved through a Linux construct called Namespace
. Let’s dip our feet into namespaces.
To avoid any confusion, it’s worth mentioning that these namespaces have nothing to do with Kubernetes Namespaces. They are two completely different concepts.
First let’s run a container from the Ubuntu image and open a bash process in it:
ubuntu@sbh:~$ docker run --rm -it ubuntu bash
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49b384cc7b4a: Pull complete
Digest: sha256:3f85b7caad41a95462cf5b787d8a04604c8262cdcdf9a472b8c52ef83375fe15
Status: Downloaded newer image for ubuntu:latest
root@127f8e451bf7:/#
This should be a familiar sight. We can easily notice that a lot of things are different inside the container than on the host. Let’s consider hostname
for example.
# On the host
ubuntu@sbh:~$ hostname
sbh
# Inside the container
root@127f8e451bf7:/# hostname
127f8e451bf7
‘Duh!’, you may be thinking.
But this separation is an important feature of containers, regardless of how simple it looks. We can see similar separation for PIDs:
On the host:
ubuntu@sbh:~$ sudo ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 05:46 ? 00:00:13 /sbin/init
root 2 0 0 05:46 ? 00:00:00 [kthreadd]
root 3 2 0 05:46 ? 00:00:00 [rcu_gp]
root 4 2 0 05:46 ? 00:00:00 [rcu_par_gp]
root 5 2 0 05:46 ? 00:00:00 [slub_flushwq]
root 6 2 0 05:46 ? 00:00:00 [netns]
root 8 2 0 05:46 ? 00:00:00 [kworker/0:0H-kblockd]
root 11 2 0 05:46 ? 00:00:00 [mm_percpu_wq]
root 12 2 0 05:46 ? 00:00:00 [rcu_tasks_rude_kthread]
root 13 2 0 05:46 ? 00:00:00 [rcu_tasks_trace_kthread]
root 14 2 0 05:46 ? 00:00:00 [ksoftirqd/0]
root 15 2 0 05:46 ? 00:00:01 [rcu_sched]
.......
root 6593 6573 0 08:09 pts/0 00:00:00 bash
There are a lot of processes, including a bash
process (PID 6593
), which is the container process.
However, when we look for processes from inside the container, we don’t see much else other than the bash process:
root@127f8e451bf7:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 08:09 pts/0 00:00:00 bash
root 9 1 0 08:14 pts/0 00:00:00 ps -ef
Just like the hostname
, the processes (ie the PIDs
) are completely different as seen from the host and from the container. Not just these, users and groups, network interfaces, mount points, etc are also different inside the container. All these are achieved through namespaces.
Namespaces control what resources a process can ‘see’. By putting a process in a namespace, we can restrict it’s view of the system. There are several types of namespaces in Linux — UTS namespaces, PID namespaces, network namespaces, etc. See the namespaces man page for all types and more information.
Each ‘type’ of namespace isolates a type of resource. For example, a network
namespace isolates network interfaces from the processes inside that namespace. If a process belongs to a particular PID namespace, it can only see the PIDs(processes) in that namespace.
UTS namespace isolates hostname and domain name. So, the reason we saw different hostnames on the host and inside the container above is because the two processes which ran the command belonged to two different UTS namespaces.
We can see the namespaces on the system by running the lsns command. Before starting the container, the output of lsns looked like this:
ubuntu@sbh:~$ sudo lsns
NS TYPE NPROCS PID USER COMMAND
4026531834 time 124 1 root /sbin/init
4026531835 cgroup 124 1 root /sbin/init
4026531836 pid 124 1 root /sbin/init
4026531837 user 124 1 root /sbin/init
4026531838 uts 120 1 root /sbin/init
4026531839 ipc 124 1 root /sbin/init
4026531840 net 124 1 root /sbin/init
4026531841 mnt 113 1 root /sbin/init
4026531862 mnt 1 25 root kdevtmpfs
4026532171 mnt 1 153 root /lib/systemd/systemd-udevd
4026532172 uts 1 153 root /lib/systemd/systemd-udevd
4026532191 mnt 1 401 systemd-network /lib/systemd/systemd-networkd
4026532202 mnt 1 403 systemd-resolve /lib/systemd/systemd-resolved
4026532203 mnt 2 485 _chrony /usr/sbin/chronyd -F 1
4026532211 mnt 2 4497 root dockerd --group docker --exec-root=/run/snap.docker --data-root=/var/snap/docker/common/var-lib-docker --pidfile=/run/snap.docker/docker.pid --config-file=/var/snap
4026532260 mnt 1 450 root /usr/sbin/irqbalance --foreground
4026532261 uts 2 485 _chrony /usr/sbin/chronyd -F 1
4026532262 mnt 1 467 root /lib/systemd/systemd-logind
4026532263 uts 1 467 root /lib/systemd/systemd-logind
4026532321 mnt 1 539 root /usr/sbin/ModemManager
After the container was started, a few more namespaces of different types were added:
ubuntu@sbh:~$ sudo lsns
NS TYPE NPROCS PID USER COMMAND
...... < pre existing namespaces > .......
4026532215 mnt 1 6593 root bash
4026532216 uts 1 6593 root bash
4026532217 ipc 1 6593 root bash
4026532218 pid 1 6593 root bash
4026532219 net 1 6593 root bash
...... < more pre existing namespaces > .......
We can see that docker created namespaces of type mnt, uts, ipc, pid and net for the container process (that has PID 6593). This is the reason why PIDs, hostname, and so on were isolated. The container process has no different time namespace associated with it, so it ‘shares’ the host’s time namespace. That means the system time inside and outside the container are the same.
To see the creation and removal of namespaces more clearly, you can run the lsns command with the -t flag to filter by namespace type.
When the container is running, we see two PID namespaces, one for the host processes and one for the container process.
ubuntu@sbh:~$ sudo lsns -t pid
NS TYPE NPROCS PID USER COMMAND
4026531836 pid 123 1 root /sbin/init
4026532218 pid 1 6593 root bash
When the container is stopped and removed, we see only one PID namespace, as the container namespace is removed:
ubuntu@sbh:~$ sudo lsns -t pid
NS TYPE NPROCS PID USER COMMAND
4026531836 pid 126 1 root /sbin/init
We can also create our own namespaces and run processes inside them. To do so, we can use the unshare
command, because we no longer want to ‘share’ the namespace with the rest of the system.
Let’s create a UTS namespace for a new process and try to simulate the behavior we saw on the docker container. I will open two terminal sessions and call them ‘terminal 1’ and ‘terminal 2’. We will run a process with our new namespace in terminal 2.
In terminal 2, let’s execute the unshare
command to run a bash
process with uts namespace unshared from the parent:
root@sbh:/home/ubuntu# hostname newname
root@sbh:/home/ubuntu# hostname
newname
We see that the hostname has changed for this process. In terminal 1, where the shell is in the host’s original UTS namespace, the hostname is the same as before:
ubuntu@sbh:~$ hostname
sbh
In terminal 2, we can exit the unshared bash
process to see that the hostname is again sbh
. This is akin to getting out of the container shell and into the host shell.
root@sbh:/home/ubuntu# exit
exit
ubuntu@sbh:~$ hostname
sbh
This was a very brief introduction to namespaces. We can build on top of this to create namespaces of other types to make our process more and more like the docker container that we ran in the beginning. For more details, you may refer to Chapter 4 — Container Isolation
of the book Container Security. In fact, I highly recommend going through the entire book.