Linux Namespaces (Container Technology)

Ryan Zheng
Geek Culture
Published in
21 min readMay 9, 2021

--

Containers on Linux utilize the namespaces provided by the Linux kernel. On Linux, the containers are running as normal processes which share the same kernel as other processes. On Windows or Mac, one virtual machine is provided first. All the containers are running inside the virtual machine.

Resource Namespace

In computer science, everything can be represented by an object, including network devices, processes, threads, PID numbers, routing tables, file systems, etc.

If we define a namespace object, then put the network device objects as a member of this namespace object, then we are able to restrict the scope of network devices to a certain namespace. The same goes for other objects.

The above relationship can be represented by

struct namespace {
struct businessobject *scopedobject
}
struct businessobject {
//other members of business object
}

Linux utilizes the above concepts. It creates different namespaces for different business objects. For example, it creates pid_namespace to scope pid_t objects. It creates net_namespace to scope the network device objects, ports, routing table, etc. There are several different types of namespaces https://man7.org/linux/man-pages/man7/namespaces.7.html. Each process contains a reference to the active namespaces.

struct task_struct {
struct nsproxy *nsproxy;
}
struct nsproxy {
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
struct cgroup_namespace *cgroup_ns;
};

In this article, we will focus on four namespaces: PID, Mount, User, Network namespaces.

PID Namespace

Each process on Linux has a unique PID number. The processes are in a tree structure. When one process creates another, it becomes the parent of the other process. The first process on Linux is the init process which has PID=1.

We can use pstree -n to show the processes in a tree structure

By default, Linux creates one init pid_namespace, all the processes reside inside this init pid_namespace.

struct pid_namespace {
struct pid_namespace *parent;
struct user_namespace *user_ns;
}
extern struct pid_namespace init_pid_ns;

Processes can create a new pid_namespace and put their child processes in the new PID namespace. The new pid_namespace becomes a child the parent pid_namespace. The PID number in a new pid_namespace starts with 1.

three-layer PID namespace

When a process inside a pid_namespace sees itself, it sees its subjective PID from within its own pid_namespace. However, when another process outside of the pid_namespace sees it, the other process uses its own pid_namespace as reference. It sees a different PID number.

In the picture, we have a three-layer PID namespace hierarchy. The processes in the child-namespace see themselves having PID 1,2,3. However, processes from init-namespace see the processes in the child-namespace having PID 4,5,6.

Processes in the grandchild-namespace see themselves having PID 1,2,3. However, the processes from child-namespace see them having PID 4,5,6. The processes from init-namespace see them having PID 7,8,9.

A process has one process ID in each of the layers of the PID
namespace hierarchy in which is visible, and walking back though
each direct ancestor namespace through to the root PID namespace.
https://man7.org/linux/man-pages/man7/pid_namespaces.7.html

Experiment

We can use unshare command to create a new PID namespace. The unshare command with -f option will fork a shell and put the shell into the new PID namespace. We will explain -m parameter later.

root@ryan-ai unshare -fpm
root@ryan-ai:~# echo $$ => output the shell process pid.
1

As we can see above, the shell process has PID=1 in the new namespace.

root@ryan-ai:~# ps
PID TTY TIME CMD
14818 pts/23 00:00:00 su
14819 pts/23 00:00:00 bash
21236 pts/23 00:00:00 unshare
21237 pts/23 00:00:00 bash
22292 pts/23 00:00:00 ps

However, the result of ps actually shows that our shell(the last one) has pid=21237. The reason is /proc file system shows the process information from the PID namespace of the process that did the mount. We have to remount the /proc

root@ryan-ai:~# mount -t proc proc /proc
root@ryan-ai:~# ps
PID TTY TIME CMD
1 pts/23 00:00:00 bash
31 pts/23 00:00:00 ps

Now ps outputs pid=1 for our bash.

We can also use lsns command to show the PID namespace for our bash.

sudo ./lsns -p 21237NS TYPE NPROCS PID USER COMMAND
4026531835 cgroup 381 1 root /sbin/init splash
4026531837 user 309 1 root /sbin/init splash
4026531838 uts 380 1 root /sbin/init splash
4026531839 ipc 380 1 root /sbin/init splash
4026531840 mnt 372 1 root /sbin/init splash
4026531993 net 306 1 root /sbin/init splash
4026532772 pid 1 21237 root -bash

NRPROCS means the number of processes in the namespace. The last row shows that our bash is the only process in the new PID namespace.

Now we successfully create one process with pid=1 in the new namespace. This mimics the init process from a new Operating system.

Mount Namespace

On Linux, the whole directory tree structure is determined by the mount table. The mount table tells us: what filesystem is mounted to what mount point.

struct mount {
hlist_node mnt_hash; // a node in the mount_hashtable
struct mount *mnt_parent; //parent mount
struct dentry *mnt_mountpoint; //mount point
struct vfsmount mnt; //filesystem
}struct mountpoint {
struct dentry *m_dentry;
struct hlist_head m_list;//mounts for the same mountpoint
};

The mounts in Linux are in a tree structure. The parent mount can have multiple children.

mount tree

At the same time, multiple different file systems can be mounted to the same mount point. The last file system mounted is the effective one.

one mount point having multiple mounts

Experiments:

~$ sudo mount -t tmpfs none  $HOME/mnt
~$ sudo mount -t ext4 /dev/sdb2 $HOME/mnt
~$ mount
####The last two entries###
none on /home/ryan/mnt type tmpfs (rw,relatime)
/dev/sdb2 on /home/ryan/mnt type ext4 (rw,relatime,data=ordered)

In the above experiment, we mounted two file systems onto the same folder $HOME/mnt. The last two entries of the mount command show our mounts. The effective one is the mount /dev/sdb2.

In order to quickly find whether a certain dir has mounted a file system, Linux mains one mount_hashtable. This is one array. Each entry is a single-linked list of mounts of the same mount-point.

static struct hlist_head *mount_hashtable __read_mostly;

By default, Linux creates one init mount namespace. The root mount is contained inside this namespace

struct mnt_namespace { 
struct mount * root;
}

When Linux searches for a filename, it starts from the root file system current->nsproxy->mnt_namespace->root. With each component in the path, it consults the mount_hashtable to check whether there is one mount for that directory. If there is, then use the mounted file system to continue searching for the next component. If there is no mount, then use the original file system to continue searching.

unshare command can be used to create a new mount namespace. When the new mount namespace is created, it receives a copy of the mounts from the parent namespace. All the child processes in the new mnt-namespace see no difference in the directory tree. Their view starts to change when the child processes do new mounts in the new mnt-namespace.

Experiment

~$ unshare -Umr   
~# mount -t tmpfs none /home/ryan/mnt
~/mnt# mkdir test
~/mnt# ls
test
### start a new tab, check whether test folder exist
~/mnt$ ls

In the first tab, ls shows test folder. However, in the second tab, ls shows nothing. This is because unshare -m creates a new mount namespace, and starts a shell process in the new mnt-namespace. The mounts in the child mnt-namespace is not visible to the parent by default.

There are different mount options that can propagate the mounts from the current namespace to other namespaces. We will not go into the details.

User Namespace

To understand user-namespace, we have to understand permission management on Linux. Permission management belongs to a topic of its own, we will only explain the main points.

Permission Management

I will use one metaphor to explain the transitions of permission management in Linux.

root user permission

Model One: one key opens all security boxes

In the above picture, there are different security boxes that do not have separate locks. There is only one lock for the outer security box. Once we get the key for the outer security box, all the other security boxes are automatically open for us.

This model is like the root user permission. On Linux, the effective uid of the credential of the current task is normally checked before invoking some privileged calls.

struct cred {
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task
*/
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
P'(ambient) = (file is privileged) ? 0 : P(ambient)

P'(permitted) = (P(inheritable) & F(inheritable)) |
(F(permitted) & cap_bset) | P'(ambient)

P'(effective) = F(effective) ? P'(permitted) : P'(ambient)

P'(inheritable) = P(inheritable) [i.e., unchanged]
kernel_cap_t cap_inheritable; /* caps our children can inherit */ kernel_cap_t cap_permitted; /* caps we're permitted */ kernel_cap_t cap_effective; /* caps we can actually use */ This is very important. kernel_cap_t cap_bset; /* capability bounding set */ kernel_cap_t cap_ambient; /* Ambient capability set */
}

Once the process elevates its effective uid to root user, it gains the permissions to do any other things. Even if the process just wants to open a port 80, it gets the capability to crash the whole system. To elevate the effective uid, setid programs are used. The setuid programs change the effective id of the process to the root user. These programs later become major targets for hackers to exploit.

capabilities

Model Two: each security box has its own lock

To get rid of one key that opens all security boxes, Linux subdivides the operations that only root user is allowed to do into different categories of permissions called capabilities. The process no longer needs to change its effective user id, all it needs is to gain the specific capability to call the specific function.

Currently, there are 37 capabilities in total. The following prints the capabilities that the current process has.

Permitted, Bounding, Effective, Inheritable, Ambient Capabilities

In the struct cred structure, there are five categories of capabilities.

struct cred {
unsigned securebits; /* SUID-less security management*/
kernel_cap_t cap_inheritable; /* caps our children can inherit */
kernel_cap_t cap_permitted; /* caps we're permitted */
kernel_cap_t cap_effective; /* caps we can actually use */
kernel_cap_t cap_bset; /* capability bounding set */
kernel_cap_t cap_ambient; /* Ambient capability set */
}

Bounding: It’s like the total score of a subject, 100.

Permitted: Our ability only allows us to get 80 from the total.

Effective: The real score we get. Linux uses this field to do the permission check.

Inheritable: How many capabilities can be inherited from the parent. This field depends on the inheritable fields also being set on the binary file. Even if we inherited them, they are only added into the Permitted set.

Ambient: The reason for this field is that the inheritable field is hard to use. Ambient is like directly giving the capabilities to the child. They are converted to effective sets directly.

During the login process, Linux will change the user id to the normal user and start the shell for the user. After setting the uid from root to normal user, all the capabilities are recalculated based on the current kernel user id. If the process is switching from privileged user to non-privileged user, then all capabilities are dropped. We can see that from the above grep Cap /proc/$BASHPID/statusoutput. Only CapBnd has values. All the other values are empty.

$ grep Cap /proc/$BASHPID/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
$ capsh --decode=0000003fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,37

How does Shell Command work?

When the shell executes a new command, if the command is not built-in, shell will do a forkfirst, then call execve.

execve will first make a copy of the existing credentials from the current task. Then it calculates the new capabilities based on the capabilities set on the binary file and current capabilities. We will not go deep into how the capabilities are calculated.

There is one excellent article describing how this is done during the execve call. https://blog.container-solutions.com/linux-capabilities-why-they-exist-and-how-they-work.

Relationship between Resource and User Namespace

To set the capability on the binary file is tedious work. And sometimes, we do not know what capabilities our program needs in order to pass the permission checks in the kernel. Another way of gaining capability is through user namespace. The first privileged process in the user namespace gains all the capabilities(We will explain later what this means). User namespace is similar to other namespaces, we can consider that user namespace is used to scope resources of other namespaces.

Linux creates one init-usernamespace, all the other init namespaces belong to the init-usernamespace. It’s equal to say that the resources such as sockets, mounts, system clock, etc of other init-namespaces belong to the init-usernamespace.

struct mnt_namespace {      
struct user_namespace *user_ns;
}
struct pid_namespace {
struct user_namespace *user_ns;
}

User namespace is also used to scope the capabilities of processes within the user namespace.

If we create a new user namespace, then create a new mount namespace. the new mount namespace belongs to the new user namespace. All the mounts from the new mount namespace become resources of the new user namespace.

When a process is trying to access a specific resource that belongs to another user namespace. It has to check whether its credential has enough capability to do that.

User namespace allows processes to see themselves as privileged users from within the namespace. The capabilities of the process within the user namespace are also recalculated based on whether the process sees itself as a privileged user or not.

We will do experiments to see what it means.

Experiment

###Before creating a new user namespace##
ryan@ryan-ai:~$ grep Cap /proc/$BASHPID/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
###Use unshare -Ur to create user namespace, ignore -r for now##
ryan@ryan-ai:~$ unshare -Ur
root@ryan-ai:~# grep Cap /proc/$BASHPID/status
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

Initially, the shell does not have any capabilities. After unshare -Ur , the shell becomes root(it thinks it is root) and gains all the capabilities.

unshare internally first calls unshare system call, then it calls execve to run another shell. The unshare system call will create a new user namespace, and make the permitted, effective capabilities full set.

static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) {
cred->securebits = SECUREBITS_DEFAULT;
cred->cap_inheritable = CAP_EMPTY_SET;
cred->cap_permitted = CAP_FULL_SET;
cred->cap_effective = CAP_FULL_SET;
cred->cap_ambient = CAP_EMPTY_SET;
cred->cap_bset = CAP_FULL_SET;
}

Now we have two shells running in two different user namespaces.

The execve call will then recalculate the new capabilities for the new shell process. The logic of calculating the capabilities from the binary file is implemented in function linux/security/commoncap.c:cap_bprm_creds_from_file in Linux kernel.

Why does the second shell see itself as root?

kernel uid: Every Linux process has a kernel uid (cred->uid). This kernel uid never changes unless the process calls setuid. The kernel uid exists in the init user namespace, it is also called global uid.

Every user namespace has a uid_map member. This uid_map determines the uid that the process inside this namespace sees itself to have.

struct user_namespace {  
struct uid_gid_map uid_map;
struct uid_gid_map gid_map;
struct user_namespace *parent;
kuid_t owner;
kgid_t group;
}
struct uid_gid_map {
nr_extents;
struct uid_gid_extent extent[5];
};
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
};

The uid_map can contain multiple rows. Each row is composed of three parts.

first lower_first count

first: the starting uid of the process inside the namespace

lower_first: always the global uid

count: the number of uids being mapped

How is the uid_gid_extent getting populated?

Linux exposes the uid_gid_map of the user namespace of the process through the /proc/{pid}/uid_map file.

When writing to the uid_map file of a pid, the format is

ID-inside-ns   ID-outside-ns   length

Id-inside-ns: The uid seen by the process itself inside the user namespace. This value is directly assigned to uid_gid_extent.first

ID-outside-ns: The uid in the parent user namespace. This value will be converted to the global kernel uid, and the global kernel id will be assigned to uid_gid_extent.lower_first . A matching extent of the uid_gid_map from the parent user namespace is found first

for (idx = 0; idx < extents; idx++) {
first = map->extent[idx].lower_first;
last = first + map->extent[idx].count - 1;
if (ID-outside-ns >= first && ID-outside-ns<= last)
return &map->extent[idx];
}

then the following calculation is done to calculate the global kernel uid.

output = (ID-outside-ns - extent->first) + extent->lower_first;

The outcome of this is that the lower_first of every user map of all user namespaces will contain the global uid.

https://code.woboq.org/linux/linux/kernel/user_namespace.c.html:map_id_range_down

length: Length is directly assigned to uid_gid_extent.count

When Reading from the uid_map file of the target pid

first field: Directly comes from uid_gid_extent.first of the user namespace of the pid.

second field:

case 1: the user namespace of the reader process is the same as the user namespace of the pid. The extent is the first matching one of uid_gid_map from the parent user namespace. https://code.woboq.org/linux/linux/kernel/user_namespace.c.html#map_id_up_base

lower_first is from the namespace of the pid.

output = (lower_first - extent->lower_first) + extent->first;

case 2: the user namespace of the reader process is not the same as the user namespace of the pid. The extent is the matching one of the user namespace of the reader process. lower_first is from the namespace of the pid.

output = (lower_first - extent->lower_first) + extent->first;

third: directly comes from the uid_gid_extent.count

Experiments:

Previously we see the new shell process has root uid. It is because when unshare -Ur is invoked, 0 1000 1 is written into the /proc/self/uid_map before execve shell.

According the algorithm we described above, the uid_gid_extent will contain the same values as 0 1000 1 . This means that the global kernel id=1000 is mapped to uid=0 root user inside the user namespace. The child shell has kernel uid=1000 which is my login user id. When execve sees that the process is mapped to uid=0, it calls https://code.woboq.org/linux/linux/security/commoncap.c.html: handle_privileged_root to restore all the capabilities for the child shell process.

Now, show the uid_map in shell-B

## output uid_map from shell of the new usernamespace##root@ryan-ai:~# cat /proc/$BASHPID/uid_map
0 1000 1

It is the same values as 0 1000 1, this is in accordance with case 2 when reading the uid_map file from the same user namespace.

Start another tab, and show shell-B’s uid_map from the user namespace of shell-A

ryan@ryan-ai:~$ cat /proc/9985/uid_map 
0 1000 1

The output is in accordance with case 1, the reader’s user namespace is different from the target namespace of the pid. shell-A’s user namespace is the init-usernamespce in this case.

Now in shell-B, create another user namespace as the grandchild, and output the uid_map from shell-C

(base) root@ryan-ai:~# unshare -Ur
(base) root@ryan-ai:~# cat /proc/self/uid_map
0 0 1

This time, the output in shell-C is 0 0 1. The reason is that shell-B wrote

0 0 1 into the /proc/{shell-C-pid}/uid_map file. Linux internally will convert the 0 into global kernel uid=1000 when writing into the uid_gid_extent struct of the user namespace of shell-C uid_gid_extent.lower_first=1000

Now, output the uid_map of shell-C from the namespace of shell-A. The output should still be the same as 0 1000 1 according to the algorithm.

(base) ryan@ryan-ai:~$ cat /proc/9985/uid_map 
0 1000 1

How user namespace is used in Docker?

I have one docker-compose file to run the postgres database.

version: '3'
services:
database:
image: "postgres" # use latest official postgres version
container_name: qiusuo-postgres
environment:
POSTGRES_USER: qiusuo
POSTGRES_PASSWORD: qiusuo
POSTGRES_DB: qiusuo
ports:
- "5432:5432"
volumes:
- ../data:/var/lib/postgresql/data

By default, postgres is running as postgres user which is specified by the RUN command in the Dockerfile

https://github.com/docker-library/postgres/blob/master/13/Dockerfile

# explicitly set user/group IDsRUN set -eux; \ groupadd -r postgres --gid=999; \# https://salsa.debian.org/postgresql/postgresql-common/blob/997d842ee744687d99a2b2d95c1083a2615c79e8/debian/postgresql-common.postinst#L32-35 useradd -r -g postgres --uid=999 --home-dir=/var/lib/postgresql --shell=/bin/bash postgres; \

Show the PID of the docker container and output the container process’s uid_map

ryan@ryan-ai:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
da6ec133bdbd postgres "docker-entrypoint.s…" 5 weeks ago Up 18 seconds 0.0.0.0:5432->5432/tcp qiusuo-postgres
ryan@ryan-ai:~$ docker inspect 2c78153ba924 | grep -i Pid
"Pid": 18559,
"PidMode": "",
"PidsLimit": null,
ryan@ryan-ai:~$ cat /proc/18559/uid_map
0 0 4294967295
ryan@ryan-ai: sudo ./lsns -p 18559
NS TYPE NPROCS PID USER COMMAND
4026531835 cgroup 361 1 root /sbin/init splash
4026531837 user 310 1 root /sbin/init splash
4026532620 mnt 7 18559 guest-cie0jn postgres
4026532621 uts 7 18559 guest-cie0jn postgres
4026532622 ipc 7 18559 guest-cie0jn postgres
4026532624 pid 7 18559 guest-cie0jn postgres
4026532626 net 7 18559 guest-cie0jn postgres

lsns shows that our Postgres is actually using the init user namespace. It has its own mnt, pid, net namespaces.

How to make the docker container run as a non-root user, however as a root inside the container process?

We can make the docker daemon create another user namespace and write specific mapping into the container process. There is a guide on the docker page about how to achieve this.

https://docs.docker.com/engine/security/userns-remap/

Net-Namespace

On Linux, we can create multiple virtual network devices, like bridges, virtual ethernets, etc. Net-Namespace allows us to scope the visibility of network devices to different namespaces. When we add a certain network device to a namespace, that network device is only visible to processes inside the net-namespace. This gives the processes the impression that they are running on a different host machine with their own ethernet cards.

We will not go into theories, experiments are more practical in this regard.

//net namespace
struct net {
struct list_head list; /* list of network namespaces */
struct user_namespace *user_ns;
}

Goal: The goal is to set up one LAN on the local machine using virtual network devices.

Bridge: Bridge is a device with two ports, it works on the second layer of OSI model. It is able to route the input packets from one port to another based on the output MAC address of the data frames.

We can use ip link to add a bridge on Linux, and show it

ryan@ryan-ai:~$ sudo ip link add name qcontainer type bridge
ryan@ryan-ai:~$ ip link
27: qcontainer: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 76:2c:a9:ba:2e:db brd ff:ff:ff:ff:ff:ff

Gateway: Gateway is the node that behaves like the entry point of a network. All traffic enters the network through the gateway. All traffic goes out through the gateway. In our home, we normally have a router, the router is the gateway.

We can use ip r to print the IP address of the gateway(router)

ryan@ryan-ai:~$ ip r
default via 192.168.178.1 dev wlp10s0 proto static metric 600

Veth

Linux gives us the tools to create virtual ethernet devices. Virtual ethernet devices are similar to physical Ethernet interfaces. We can assign IP addresses, bind, and configure IP rules on them. Veth devices appear in pairs. They behave like two ethernet interfaces that are connected by a cable. The packet received on the one will be directly forwarded to the other.

We can use ip link to create two veth pairs and to display them.

ryan@ryan-ai:~$ sudo ip link add peer1 type veth peer name peer2
ryan@ryan-ai:~$ sudo ip link add peer3 type veth peer name peer4
ryan@ryan-ai: ip link
28: peer2@peer1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 0a:f7:83:14:74:df brd ff:ff:ff:ff:ff:ff
29: peer1@peer2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fe:94:a8:74:4f:90 brd ff:ff:ff:ff:ff:ff
30: peer4@peer3: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:d1:ec:dc:74:1b brd ff:ff:ff:ff:ff:ff
31: peer3@peer4: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8e:45:0d:b1:2b:f8 brd ff:ff:ff:ff:ff:ff

The displayed veth interface names are referring to each other by appending its peer name to its own name.

Now we can connect the peer2 and peer4 veth to the qcontainer bridge according to the previous picture.

ryan@ryan-ai:~$ sudo ip link set peer2 master qcontainer
ryan@ryan-ai:~$ sudo ip link set peer4 master qcontainer

Currently, ip link shows all the veth devices. We need to put peer1 and peer3 into a new net namespace.

##create a new net namespace qcontainer##
ryan@ryan-ai:~$ sudo ip netns add qcontainer
ryan@ryan-ai:~$ sudo ip link set peer1 netns qcontainer
##add peer1 and peer3 into the qcontainer net namespace ##
ryan@ryan-ai:~$ sudo ip link set peer1 netns qcontainer
ryan@ryan-ai:~$ sudo ip link set peer3 netns qcontainer

Use ip link to show the interfaces again.

ryan@ryan-ai:~$ ip link
27: qcontainer: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 0a:f7:83:14:74:df brd ff:ff:ff:ff:ff:ff
28: peer2@if29: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master qcontainer state DOWN mode DEFAULT group default qlen 1000
link/ether 0a:f7:83:14:74:df brd ff:ff:ff:ff:ff:ff link-netnsid 1
30: peer4@if31: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master qcontainer state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:d1:ec:dc:74:1b brd ff:ff:ff:ff:ff:ff link-netnsid 1

We can see that peer1 and peer3 disappeared in the init-netnamespace.

Use ip link to show interfaces inside the net namespace.

ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
29: peer1@if28: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fe:94:a8:74:4f:90 brd ff:ff:ff:ff:ff:ff link-netnsid 0
31: peer3@if30: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8e:45:0d:b1:2b:f8 brd ff:ff:ff:ff:ff:ff link-netnsid 0

The interesting thing is there is a loopback interface in the new net namespace.

Now we need to assign IP addresses to peer1, peer3, and the bridge.

ryan@ryan-ai:~$ sudo ip addr add 172.16.23.1/24 brd + dev qcontainer
ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip addr add 172.16.23.23/24 dev peer1
ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip addr add 172.16.23.24/24 dev peer3

We can show the IP address.

ryan@ryan-ai:~$ ip -4 addr
27: qcontainer: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
inet 172.16.23.1/24 brd 172.255.255.255 scope global qcontainer
valid_lft forever preferred_lft forever
### show ip address in qcontainer net namespace##ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip -4 addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
29: peer1@if28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netnsid 0
inet 172.16.23.23/24 scope global peer1
valid_lft forever preferred_lft forever
31: peer3@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netnsid 0
inet 172.16.23.24/24 scope global peer3
valid_lft forever preferred_lft forever

Now we need to bring all the interfaces up.

ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip link set lo up
ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip link set peer1 up
ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip link set peer3 up
ryan@ryan-ai:~$ sudo ip link set peer2 up
ryan@ryan-ai:~$ sudo ip link set peer4 up
ryan@ryan-ai:~$ sudo ip link set qcontainer up

After all this, we need to configure qcontainer bridge as the gateway for the container network. We also need to configure route to direct packets with network address 172.16.23.0/24 to qcontainer bridge.

ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip route add default via 172.16.23.1
ryan@ryan-ai:~$ sudo ip route add 172.16.23.0/24 dev qcontainer
## print the ip routing table in qcontainer net-namespace ##
ryan@ryan-ai:~$ sudo ip netns exec qcontainer ip r
default via 172.16.23.1 dev peer1
172.0.0.0/8 dev peer1 proto kernel scope link src 172.16.23.23
172.0.0.0/8 dev peer3 proto kernel scope link src 172.16.23.24
## print the ip routing table in init net-namespace ##
(base) ryan@ryan-ai:~$ ip r
172.16.23.0/24 dev qcontainer proto kernel scope link src 172.16.23.1

Let’s try to ping 172.16.23.23 and 172.16.23.24.

ryan@ryan-ai:~$ ping 172.16.23.23
PING 172.16.23.23 (172.16.23.23) 56(84) bytes of data.
64 bytes from 172.16.23.23: icmp_seq=1 ttl=64 time=0.042 ms
ryan@ryan-ai:~$ ping 172.16.23.24
PING 172.16.23.24 (172.16.23.24) 56(84) bytes of data.
64 bytes from 172.16.23.24: icmp_seq=1 ttl=64 time=0.074 ms

It rocks!

How does docker use the net namespace?

Let’s look at the container net-namespace.

ryan@ryan-ai:~$ docker inspect 2c78153ba924 | grep -i pid
"Pid": 7713,
"PidMode": "",
"PidsLimit": null,
ryan@ryan-ai:~/Software/utils-linux/source/util-linux$ sudo ./lsns -p 7713
NS TYPE NPROCS PID USER COMMAND
4026532795 net 7 7713 guest-cie0jn postgres
ryan@ryan-ai:~$ su -
root@ryan-ai:~# mkdir -p /var/run/netns
root@ryan-ai:~# ln -sfT /proc/7713/ns/net /var/run/netns/2c78153ba924
root@ryan-ai:~# ip netns exec 2c78153ba924 ip r
default via 172.19.0.1 dev eth0
172.19.0.0/16 dev eth0 proto kernel scope link src 172.19.0.2
root@ryan-ai:~# ip netns exec 2c78153ba924 ip -4 addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
18: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link-netnsid 0
inet 172.19.0.2/16 brd 172.19.255.255 scope global eth0
valid_lft forever preferred_lft forever

The above commands enter the net-namespace of the container, outputs the routing table, and the interface IP addresses. The interface eth0@if19 within the net namespace has IP address 172.19.0.2/16

The following command shows interface addresses in the init-net namespace.

root@ryan-ai:~# ip -4 addr10: br-eb853960c372: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
inet 172.19.0.1/16 brd 172.19.255.255 scope global br-eb853960c372
valid_lft forever preferred_lft forever

There is actually a bridge with IP address 172.19.0.1/16.

If we print the iptables, we can see that Docker configures rules for forwarding the packets with destination port 5432 to the bridge.

root@ryan-ai:~# sudo iptables -S  | grep 5432
-A DOCKER -d 172.19.0.2/32 ! -i br-eb853960c372 -o br-eb853960c372 -p tcp -m tcp --dport 5432 -j ACCEPT

Linux namespaces are the building block for container technology and it’s amazing.

--

--

Ryan Zheng
Geek Culture

I am a software developer who is keen to know how things work