Sale!

Assignment #3: Simple Resource Container

$30.00

Category:

ECSE427/COMP310 (Version 1) Page 1 of 12
Programming Assignment #3: Simple Resource Container
Due date: Check My Courses`
This assignment could be completed in groups of 2 (max). You are expected to split the workload
evenly in the group.
This assignment needs root privileges. We have created an OS playground in cs310.cs.mcgill.ca.
You can use that or your own machine. You are strongly advised to use another container like
Docker or a Virtual Machine to develop this assignment. This way you can prevent any harmful
modifications to your Linux installation. We have provided a default docker image (in
cs310.cs.mcgill.ca) which you can use to spawn a container. Then you will enter this container and
tryout everything that is listed below. See APPENDIX to how to get into this playground
environment. READ APPENDIX before you proceed.
This assignment will be done in two stages. In this first stage of the assignment, you are
exploring all the concepts that go into a Simple Resource Container (SRContainer). The way
you would do this is by building and experimenting with SRContainer without writing any
code. We need to make few simplifying assumptions to make this work. In the second stage,
you will implement the real SRContainer using the template we provide to you. The knowledge
you gain with this part is necessary to complete that part. The template will allow you to
complete the assignment with minimal amount of coding.
1. Overview
In this assignment you are expected to develop a SRContainer that would run in Linux. The
SRContainer is modelled after the highly popular Docker container format. The SRContainer is
implemented using advanced features supported by Linux such as: namespaces, control groups, system
call filtering, and fine-grained capability management.
A container is a virtual machine look alike. It does what we expect from a machine from an applications
point-of-view. It is not a machine by itself. Rather, it is part of another machine – the host. The basis for
container creation is the clone() system call that was already discussed in Assignment 1. A container is
simply a process spawned within the host with proper isolation (using clone() flags). The container has its
own resources or partitions of the resources. It has its own file system, network configuration, memory and
CPU slices. For example, the file system could be specific to the container. A container running on Ubuntu
could have an Alpine Linux file system. Because the containers are sharing the kernel instance running in
the host, you cannot have a file system from an incompatible operating system (non Linux OS). For instance,
a FreeBSD file system would not work for a container inside a host running some Linux kernel.
In a Unix-based operating system, we already have a strong isolation scheme between processes in
terms of fault tolerance. This allows one process to keep running while another process may crash.
However, from a security point-of-view the isolation between the processes is quite weak. This has
motivated research into sandboxing where a group of processes could be isolated from the host or from
another group of processes. The web browser application is a good example that uses process sandboxing
to shield the host and the user data from untrusted code downloaded from remote sites and run within the
browser.
ECSE427/COMP310 (Version 1) Page 2 of 12
For the purposes of this assignment we consider the container as follows:
Container = Processes + Isolation + File System Image + Resource Principal + Host-Kernel
The major purpose of this assignment is to learn the different ways isolation can be added to processes
to create containers. Also, we see how the newly created container can be made to run a file system that is
different from the host filesystem. You are expected to use this assignment to explore the design space for
containers in a Linux like operating system. The knowledge from this first stage will allow you to complete
the “from-scratch-containers” skeleton code in the second stage.
Another major feature of containers is process independent resource monitoring, accounting, and
control. Using this feature, we can package an application inside a container and use it to monitor and
control the resource usage on a per-application basis. This is very helpful with micro-services (a popular
software engineering paradigm) when they are hosted in clouds. With an application focused resource usage
control framework, we can precisely manage the resource consumption levels.
2. Exploring Namespaces
Processes are built on the idea of address spaces. Each process has its own address space that prevents
a process from interfering with another process. The container idea is based on a similar concept called
namespaces. When you start the operating system all the processes go into the same namespace. In Linux,
we have 7 namespaces: PID, IPC, UTS, NET, USER, CGROUP, and MNT. So, without containers all
processes would go into the same instance of the PID namespace, same instance of the IPC namespace and
so on. First step is to verify this yourself.
Go to the /proc interface. This interface allows you to see the current kernel state and even control it.
Change into this directory and explore it. You will see lot of information. Each process will have its own
folder with its process ID (PID). Change into the folder self. This is pointing to the process of the shell
that you are using to explore the /proc interface. You can confirm that by looking into the cmdline
information available in the folder. Go to the ns folder to look at the namespace information. You will see
something like the following there. It shows the information for each type of namespace for the shell.
You can explore different processes running in the system including the init (or systemd) process
(process ID 1) that bootstrapped the system. Check the namespace information for the different processes
and verify that most of them belong to the same namespace.
Print out the process tree using the following command.
pstree -g -s
A partial output of the process tree printed by the previous command is shown below.
ECSE427/COMP310 (Version 1) Page 3 of 12
You will notice that the host is running a whole lot of processes. The process tree shows all the
processes running in the host. This would include even the processes that are inside containers because they
are also running the host.
Let’s start experimenting with namespaces that would allow us to put processes in different
namespaces. We begin the experimentation using the unshare command. The “unshare” command (a
system call and has the shell command wrapper) allows you to run processes in namespaces disassociated
to the parent namespace. You can unshare from any/all of the parent’s namespaces. Let’s use it to run a
shell (let’s use the /bin/sh to start with) in a different namespace. That is, we want to create a namespace
and put the shell in that one. The unshare command does that and you can select the namespace type using
one of its options. We will put the shell in its own PID namespace (the -p option).
sudo unshare -fp /bin/sh
Note: The -f option tells unshare to run the program “/bin/sh” as a sub-process via fork(). If not specified
the unshare process itself will be replaced by /bin/sh which can cause issues. Read here for more clarity.
Run the pstree command previously shown in the shell. Observe the processes that are shown in the
output. They are not much different from the previous list because the pstree command is getting the process
listing information from the /proc file system interface. So, we are seeing all the processes instead of
confining the output to the newly created namespace that is holding the shell process. To confine the process
listing to the ones in the newly created namespace, we need to mount the proc file system again. This is
done in the command below.
sudo unshare -fp –mount-proc=/proc /bin/bash
Run the pstree command in the shell. Note the processes that are listed over there. You will see an
output like the following. It is interesting to note that instead of systemd (init) process, we have the bash
program as the originating process.
Using the unshare command you can start isolated process groups. For instance, in the above
command you launched the bash shell. Now, using the bash shell you can any program, which can
potentially spawn many numbers of child processes. All of those processes will inherit the new namespace
that we just created using the unshare command. Processes that were put into their own namespaces with
“unshare” are what are called containers. You can run multiple processes inside this new namespace, which
means multiple processes running inside the container. Namespaces providing the building blocks for
containers.
ECSE427/COMP310 (Version 1) Page 4 of 12
Start two different bash shells using the previous command. These two bash shells will be in two
different PID namespaces. Run some arbitrary programs in the two different shells. Once you are
comfortable with the shells, do the following experiments.
1. Run arbitrary programs (eg: “ping 8.8.8.8”, “tr ABC 123”) in the two different shells. Do you
see the processes that you run in one shell from the other shell?
2. Do you see the programs you run in the shell from the host?
3. Can you kill the programs that you run the shells from outside (i.e., outside the shell but in the
same host)? To do this you need to have another terminal. If you are doing this experiment in a
Docker container, you need to run a docker exec to get into the container because the docker
container is your host.
4. Launch some programs in the host. Do you see those processes inside the shells?
(use: htop -u <SOCS_USERNAME> to list processes run by you)
In the above experiment, you used the PID namespace. The general idea of PID namespace is to reuse
the same PID values in the different namespaces. For instance, the host would have PID values starting
from 1, where 1 is the init (systemd) process. A child namespace would also have its own PID 1. The
container system designer is free to select the program that would actually be the PID 1 inside each
namespace. Once the process with PID-1 inside a namespace (container) dies the container ceases to exist.
Just like how the system is shutdown when init is down. There are many advantages of PID reuse in the
different namespaces. One of them is the ability to move a set of processes from one machine to another.
For more information about PID namespaces type the following command: man pid_namespaces
Now, let’s turn the attention to another namespace: USER. The objective of USER namespace is again
the same as the objective of the PID namespace. We want to have the same user ID and group ID values in
different namespaces. You can get more information regarding from it man page. For example, UID 0 and
group ID 0 are associated with the root user and root group, respectively. The screenshot below shows the
entries from a password file (cat /etc/passwd). For more information on USER namespaces, consult the
man page: man user_namespaces
Let’s simply launch a shell with a detached USER namespace like the following. This command is
going to put the shell in a different namespace from the parent for both PID and USER namespaces.
sudo unshare -fpU –mount-proc=/proc /bin/bash
Note: its uppercase “U”. Lowercase “u” is for UTS namespace.
ECSE427/COMP310 (Version 1) Page 5 of 12
Immediately, you will notice a problem. The user running the bash has changed now from the previous
run that had the same user mapping across the host (parent namespace) and the “container” (child
namespace). When you run without detaching USER namespace like the following:
sudo unshare -fp –mount-proc=/proc /bin/bash
Note: type “id” to check who the user is inside the container/unshared-namespace.
The user in the shell in the new namespace is actually root! That is, there is a seaming escalation of
privileges (i.e., a normal user in the host or the parent namespace went to administrator level in the new
namespace). In closer inspection, it turns out that there is no escalation of privilege issue because the
unshare was run as the administrator. So, although the user was a non-administrator, the sudo command
was used to escalate the privileges.
With the detached USER namespace, you will notice that the shell is running with a different prompt
that it was when you just had PID namespace. Type id in the shell to find out user ID (UID) and group ID
(gid) of the shell process. You will notice that the shell is running with nobody as the user. The nobody is a
standard user with the least privileges – just the opposite of root! See the screenshot below – it is the last
entry there.
Previously, when we ran the shell in the following way, we were running the shell as the root (most
privileged user).
sudo unshare -fp –mount-proc=/proc /bin/bash
So, when we detach the USER namespace, we just created a USER namespace and put the process
(shell) in that namespace. This is not sufficient. We need to provide a UID and GID mapping that would
relate the IDs in that space to the ones in the host USER namespace.
Run a command such as adduser that needs higher privileges in the shell with least privileges. You
will notice that the command would not work due to lack of privileges. The same command would have
progressed much further with its execution when the shell was running without a separate USER namespace
(i.e: when run as root).
Let’s fix this problem. We need to specify a UID and GID mapping.
Open another terminal in the parent namespace. Run the following command.
ps axu | grep bash
You will see the unshare../bin/bash command there. You need to note down the PID of the process.
sudo newuidmap PID 0 0 1
This will upload a mapping between the root (0) in the parent namespace and child namespace.
Similarly, for the group
sudo newgidmap PID 0 0 1
Now run the id again in the shell terminal. You will see that we have changed the user identity. It is
not nobody anymore. It is back to root. The privilege level you get changes according to the user ID
mapping.
ECSE427/COMP310 (Version 1) Page 6 of 12
Lastly, let’s examine the UTS namespace. UTS namespaces provide isolation of two system identifiers: the
hostname and the NIS domain name. Let’s consider the hostname here.
Run the following command to create a child namespace that is isolated in UTS from the parent namespace.
sudo unshare -fpu –mount-proc=/proc /bin/bash
You can change the hostname in the parent namespace using
hostname xyxy
Observe whether the hostname in the shell (child namespace) is changing or not. You can type hostname
to printout the hostname. Do the same experiment without isolated UTS namespaces and see what happens.
3. Exploring Chroot
So far, we have managed to put the shell process into different process, user and UTS namespaces.
Despite those new namespaces we could not provide the virtual machine illusion. The main reason was the
fact that the shell was reusing the same file system as the underlying host. So, we could see the files of the
host from within the container/namespace. We need to isolate the file system and provide the shell its own
file system.
To provide the shell its own file system, we need a root file system. We can get a root file system in
two different ways: create one using debootstrap or download a root file system from a repository. Using
debootstrap you can do the following. Create a rootfs folder.
debootstrap jessie jrootfs (Gets a minimalistic file-system from Debian-jessie)
debootstrap stretch srootfs (Gets a minimalistic file-system from Debian-stretch)
This should create a root file system at rootfs/jroofs that is based on Debian Jessie. You can use
this in your shell as the file system. Change into the directory after having done the other isolations as
discussed previously.
cd jrootfs
sudo unshare -fpu –mount-proc=/proc /bin/bash
chroot .
This should create a file system that is isolated from the host. Although your root file system is a folder
in the host, it gives you isolated feel because you are using different files: binaries, libraries, configs, etc.
Right away, you will run into some problems. For instance, process reated commands like ps and top
would not work. You need to mount the proc file system using a command like
mount -t proc proc /proc
Instead of using debootstrap which is good for Debian root file systems, you could download Linux
root file systems for use here. One of them in Alpine Linux (a very lightweight Linux based on busybox).
Below is the URL for downloading the root file system. Unpack it and use it just like the previous one.
ECSE427/COMP310 (Version 1) Page 7 of 12
http://dl-cdn.alpinelinux.org/alpine/v3.8/releases/x86_64/alpine-minirootfs-3.8.1-
x86_64.tar.gz
Once you do the change root, you get a VM like feel. However, we also created another problem. We
lost association with an important folder. For the next step (control resource allocations), we are going to
use control groups (cgroups). This is accessed through the /sys/fs/cgroup folder. This folder is not
accessible after the chroot. In theory, we should be able to inject this folder from the host into the shell after
the shell is launched. However, that is not working in our tests. So, here is what is known to work.
You recursively bind mount the /sys/fs/cgroup folder on the root file system before you launch
the shell. Then, you start the shell as described above. You should see the cgroup folder from your “vm” at
/sys/fs/cgroup and it should be functional!
To recursively bind mount, use the following command.
mount -–rbind /sys/fs/cgroup $ROOTFS/sys/fs/cgroup
$ROOTFS: Is the path to where your root-file-system is (eg: jrootfs created above)
You must create the folders “fs” and “cgroup” at “/sys” inside your container
Also, you should run the above command in the host/playground container
4. Exploring Control Groups
We have looked at isolation and file system for the SRContainer in the previous sections. In this section,
we are going to see how we can do resource monitoring and control. Let’s say we want to restrict the
SRContainer instance to kill an application if the application is consuming too much memory. Before we
see how that can be done, we show you an example memory-hog. This just keeps eating memory and if we
have a out-of-memory (OOM) killer, the process should be killed after it has run for a while.
int main()
{
int i;
long sz = 100000;
mlockall( MCL_FUTURE | MCL_ONFAULT ); // Lock virtual memory from being swapped to disk

for (i = 0; i < 10000; i++)
{
printf(“Allocating block %d \n”, i);
fflush(stdout);
char *p = malloc(sz);
if (p == NULL) {
printf(“Allocation error.. \n”);
fflush(stdout);
} else {
printf(“Success.. \n”);
fflush(stdout);
}
memset(p, 0, sz); // malloc’ed memory must be used
usleep(10000);
}
return EXIT_SUCCESS;
}
ECSE427/COMP310 (Version 1) Page 8 of 12
First, we create the executable from the above code. Second, we go into /sys/fs/cgroup. You will see the
memory controller. Do the following commands:
cd memory
mkdir ExMemLimiter
cd ExMemLimiter
At this point, you will see the internal directory (newly created one is already populated!).
Set the limit by using the following command.
echo <integer> > memory.limit_in_bytes
Next, we must add our container to this cgroup (controller). We will do this by writing the PID of our
container the file called “tasks” in the list above. You need to be in the
/sys/fs/cgroup/memory/ExMemLimiter folder.
echo 1 > tasks_name
Now when you run the memory hogging program given above inside your container, it must be killed upon
reaching the limit you set above. Any program that runs inside the container is governed by the memory
controller. It is important to note that we did not associate the memory hogging program directly. The
memory hogging program was restricted by the controller attached to the container.
Another, resource you can restrict per container is the CPU usage. To control this, you need to create 2
cgroup controllers: cpu and cpuset. You have to traverse into the “/sys/fs/cgroup” folder and do the
following:
cd cpu
mkdir CPULimiter
cd CPULimiter
ECSE427/COMP310 (Version 1) Page 9 of 12
You will see the following cgroup metrics enumerated:
You have make changes to the “cpu.shares” file.
echo <integer> > cpu.shares
Now, this control on the CPU is a relative control. That is, it is a ratio of CPU-percentage to be allocated to
processes of this cgroup with respect to others. So, to validate this we must have at-least two containers
running; each of them attached to a different cpu-controllers. So, you must run a second unshared-container
instance and connect it to new CPU-controller. Let’s call it CPULimiter2 cgroup. You must enter an integer
into the new cpu.shares file.
echo 10*<integer> > cpu.shares
Notice that now we have “10” times CPU-shares set. Now when you run similar loads on these two instances
of unshared-containers (each associated to a different controller), you must see CPU allocation split 1:10
between them. We have to take care of one more thing here. We have to ensure that the containers are
restricted to one CPU-core. If not, the loads may run on separate cores to 100%. We can restrict them to
one core by changing the cpuset cgroup controller. Now traverse to the cpuset folder under
“/sys/fs/cgroup” and create a new cgroups:
mkdir CPURestrictor
Change the cpuset.cpus file under this directory as follows:
echo 0 > cpuset.cpus
Change the cpuset.mems file under this directory as follows:
echo 0-1 > cpuset.mems
This will restrict both containers to CPU-core-0 with 2 memory nodes.
ECSE427/COMP310 (Version 1) Page 10 of 12
Now add the container PID to the tasks file under CPULimiter and the other under the tasks file within
CPULimiter2. Also add both (container) PIDs inside the tasks file under CPURestrictor. The
CPULimiter would set the CPU-limit on the first container instance and CPULimiter2 would set the CPUlimit on the second container instance. The CPURestrictor would restrict both containers to run on a single
core thus having to share it based on their weight (1:10).
Finally, run the following program in both unshared containers:
stress –cpu 1 -v
(you have to install stress in your rootfs)
Now when you run htop or top in the host container to observe, the CPU should be split 1:10 between the
two containers as shown below:
We will let controlling other resources such as block-io and network as a practice for you to try. We will
ask about this on the viva session for grading this assignment.
5. Testing and Evaluation
• Change hostname inside the container and show that this does not affect the host’s hostname
• Create two containers with different UID/GID mappings and explain how these mappings are
understood with respect to each namespace
• Show cgroup control over memory
• Show CPU-sharing between multiple containers
• Show Block-IO read restriction on a container
• Show Block-IO write restriction on a container
ECSE427/COMP310 (Version 1) Page 11 of 12
APPENDIX:
We have set up a dedicated server (cs310.cs.mcgill.ca) to do this entire assignment. Since,
we will be playing with some of the core kernel features, you will need “root” access to run
most of these commands. And, since it is not possible to provide all students with “root”
access to this server we have created a Docker-image that can be used by everyone.
Docker is simply a tool-set that allows you to quickly get containers up and running. A
docker image is basically a template for the container that will be spawned. So, we have
created an image that will give you a complete (container) environment within which you can
do whatever you want. You will have root (privileged) access inside this container. You can
do whatever you want inside this container and you will not affect the (cs310.cs.mcgill.ca)
server. So, we request all students to spawn their own container-instance using this image.
Then, you can get into this running instance and try-out all that is explained above. You can
see how it would look like from the figure below. Here, C1, C2 and C3 would be the containers
you will be creating inside the isolated (root’ed) container environment. Please note that,
whenever the description above uses the term “host” or “host-container” it means the
container instance from within which you will be working (eg: Shabirmean’s-container). When
it says just container or unshared-container, it denotes one of the containers “you
created” (i.e. C1, C2 or C3).
ECSE427/COMP310 (Version 1) Page 12 of 12
The name of the image made available for you is: smean_image_comp310
You can create your own running instance of a container using this image by issueing the
following command:
docker run –privileged -d –rm –name=<MEANINGFUL_NAME> smean_image_comp310
Replace <MEANINGFUL_NAME> with a proper name so you know which your own container is.
Once you spawn a container as above, it will be ever running. Now you can get into this
container by issuing the following command:
docker exec -it <MEANINGFUL_NAME> /bin/bash
When you issue the following command, you will be inside one of the blue circles shown
above. That is, you will inside your own container instance. Now you can freely start
experimenting with all the commands listed above. You will be root and can do anything
inside here. You can open another terminal, ssh into the cs310 server and issue the above
command again to get another terminal instance inside your container. Likewise, you may
enter into this container from as many terminals as you want by issuing the above command.
Useful commands:
kill -9 <PID> – Kill a process
htop -u <SOCS_USERNAME> – To list processes run by you

Reviews

There are no reviews yet.

Be the first to review “Assignment #3: Simple Resource Container”

Your email address will not be published.