6 minute read

You can also read this over at towardsk8s.com.

If you’ve ever run a shell inside a container, you realize that a container looks like a completely different machine. The shell process, and by extension you, have a completely different view of the system. But at the end of the day, the container is just another Linux process. Then how does the container seem to have so much isolation? It’s achieved through a Linux construct called Namespace. Let’s dip our feet into namespaces.

To avoid any confusion, it’s worth mentioning that these namespaces have nothing to do with Kubernetes Namespaces. They are two completely different concepts.

First let’s run a container from the Ubuntu image and open a bash process in it:

ubuntu@sbh:~$ docker run --rm -it ubuntu bash
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49b384cc7b4a: Pull complete 
Digest: sha256:3f85b7caad41a95462cf5b787d8a04604c8262cdcdf9a472b8c52ef83375fe15
Status: Downloaded newer image for ubuntu:latest
root@127f8e451bf7:/#

This should be a familiar sight. We can easily notice that a lot of things are different inside the container than on the host. Let’s consider hostname for example.

# On the host
ubuntu@sbh:~$ hostname
sbh

# Inside the container
root@127f8e451bf7:/# hostname
127f8e451bf7

‘Duh!’, you may be thinking.

But this separation is an important feature of containers, regardless of how simple it looks. We can see similar separation for PIDs:

On the host:

ubuntu@sbh:~$ sudo ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD                                                                                                                                                            
root           1       0  0 05:46 ?        00:00:13 /sbin/init
root           2       0  0 05:46 ?        00:00:00 [kthreadd]
root           3       2  0 05:46 ?        00:00:00 [rcu_gp]
root           4       2  0 05:46 ?        00:00:00 [rcu_par_gp]    
root           5       2  0 05:46 ?        00:00:00 [slub_flushwq]                        
root           6       2  0 05:46 ?        00:00:00 [netns]                                 
root           8       2  0 05:46 ?        00:00:00 [kworker/0:0H-kblockd]                                                                               
root          11       2  0 05:46 ?        00:00:00 [mm_percpu_wq]                                                                    
root          12       2  0 05:46 ?        00:00:00 [rcu_tasks_rude_kthread]                                                             
root          13       2  0 05:46 ?        00:00:00 [rcu_tasks_trace_kthread]
root          14       2  0 05:46 ?        00:00:00 [ksoftirqd/0]                                                                         
root          15       2  0 05:46 ?        00:00:01 [rcu_sched]                                                                                                             
.......
root        6593    6573  0 08:09 pts/0    00:00:00 bash

There are a lot of processes, including a bash process (PID 6593), which is the container process.

However, when we look for processes from inside the container, we don’t see much else other than the bash process:

root@127f8e451bf7:/# ps -ef 
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 08:09 pts/0    00:00:00 bash
root           9       1  0 08:14 pts/0    00:00:00 ps -ef

Just like the hostname, the processes (ie the PIDs) are completely different as seen from the host and from the container. Not just these, users and groups, network interfaces, mount points, etc are also different inside the container. All these are achieved through namespaces.

Namespaces control what resources a process can ‘see’. By putting a process in a namespace, we can restrict it’s view of the system. There are several types of namespaces in Linux — UTS namespaces, PID namespaces, network namespaces, etc. See the namespaces man page for all types and more information.

Each ‘type’ of namespace isolates a type of resource. For example, a network namespace isolates network interfaces from the processes inside that namespace. If a process belongs to a particular PID namespace, it can only see the PIDs(processes) in that namespace.

UTS namespace isolates hostname and domain name. So, the reason we saw different hostnames on the host and inside the container above is because the two processes which ran the command belonged to two different UTS namespaces.

We can see the namespaces on the system by running the lsns command. Before starting the container, the output of lsns looked like this:

ubuntu@sbh:~$ sudo lsns
        NS TYPE   NPROCS   PID USER            COMMAND
4026531834 time      124     1 root            /sbin/init
4026531835 cgroup    124     1 root            /sbin/init
4026531836 pid       124     1 root            /sbin/init
4026531837 user      124     1 root            /sbin/init
4026531838 uts       120     1 root            /sbin/init
4026531839 ipc       124     1 root            /sbin/init
4026531840 net       124     1 root            /sbin/init
4026531841 mnt       113     1 root            /sbin/init
4026531862 mnt         1    25 root            kdevtmpfs
4026532171 mnt         1   153 root            /lib/systemd/systemd-udevd
4026532172 uts         1   153 root            /lib/systemd/systemd-udevd
4026532191 mnt         1   401 systemd-network /lib/systemd/systemd-networkd
4026532202 mnt         1   403 systemd-resolve /lib/systemd/systemd-resolved
4026532203 mnt         2   485 _chrony         /usr/sbin/chronyd -F 1
4026532211 mnt         2  4497 root            dockerd --group docker --exec-root=/run/snap.docker --data-root=/var/snap/docker/common/var-lib-docker --pidfile=/run/snap.docker/docker.pid --config-file=/var/snap
4026532260 mnt         1   450 root            /usr/sbin/irqbalance --foreground
4026532261 uts         2   485 _chrony         /usr/sbin/chronyd -F 1
4026532262 mnt         1   467 root            /lib/systemd/systemd-logind
4026532263 uts         1   467 root            /lib/systemd/systemd-logind
4026532321 mnt         1   539 root            /usr/sbin/ModemManager

After the container was started, a few more namespaces of different types were added:

ubuntu@sbh:~$ sudo lsns
        NS TYPE   NPROCS   PID USER            COMMAND
...... < pre existing namespaces > .......
4026532215 mnt         1  6593 root            bash
4026532216 uts         1  6593 root            bash
4026532217 ipc         1  6593 root            bash
4026532218 pid         1  6593 root            bash
4026532219 net         1  6593 root            bash
...... < more pre existing namespaces > .......

We can see that docker created namespaces of type mnt, uts, ipc, pid and net for the container process (that has PID 6593). This is the reason why PIDs, hostname, and so on were isolated. The container process has no different time namespace associated with it, so it ‘shares’ the host’s time namespace. That means the system time inside and outside the container are the same.

To see the creation and removal of namespaces more clearly, you can run the lsns command with the -t flag to filter by namespace type.

When the container is running, we see two PID namespaces, one for the host processes and one for the container process.

ubuntu@sbh:~$ sudo lsns -t pid
        NS TYPE NPROCS   PID USER COMMAND
4026531836 pid     123     1 root /sbin/init
4026532218 pid       1  6593 root bash

When the container is stopped and removed, we see only one PID namespace, as the container namespace is removed:

ubuntu@sbh:~$ sudo lsns -t pid
        NS TYPE NPROCS PID USER COMMAND
4026531836 pid     126   1 root /sbin/init

We can also create our own namespaces and run processes inside them. To do so, we can use the unshare command, because we no longer want to ‘share’ the namespace with the rest of the system.

Let’s create a UTS namespace for a new process and try to simulate the behavior we saw on the docker container. I will open two terminal sessions and call them ‘terminal 1’ and ‘terminal 2’. We will run a process with our new namespace in terminal 2.

In terminal 2, let’s execute the unshare command to run a bash process with uts namespace unshared from the parent:

root@sbh:/home/ubuntu# hostname newname
root@sbh:/home/ubuntu# hostname
newname

We see that the hostname has changed for this process. In terminal 1, where the shell is in the host’s original UTS namespace, the hostname is the same as before:

ubuntu@sbh:~$ hostname
sbh

In terminal 2, we can exit the unshared bash process to see that the hostname is again sbh. This is akin to getting out of the container shell and into the host shell.

root@sbh:/home/ubuntu# exit
exit
ubuntu@sbh:~$ hostname
sbh

This was a very brief introduction to namespaces. We can build on top of this to create namespaces of other types to make our process more and more like the docker container that we ran in the beginning. For more details, you may refer to Chapter 4 — Container Isolation of the book Container Security. In fact, I highly recommend going through the entire book.