With out subscribers, LWN would simply no longer exist. Please have interaction into fable
signing up for a subscription and helping
to retain LWN publishing
March 23, 2016
This text used to be contributed by Rami Rosen
After many years, the Linux kernel’s administration community (cgroup)
infrastructure is undergoing a rewrite that makes changes to the API in a
option of areas. Belief the changes is important to
developers, seriously those working with containerization
projects. This text will inspect on the fresh capabilities of cgroups v2, which had been
no longer too long ago declared production-prepared in kernel 4.5. It’s far consistent with a
discuss I gave at doubtlessly the most modern Netdev 1.1 conference in
Seville, Spain. The video [YouTube] for that discuss is now accessible online.
The cgroup subsystem and connected controllers handle
administration and accounting of a range of machine sources love CPU,
memory, I/O, and more. Together
with the Linux namespace subsystem, which is a puny bit older
around 2002) and is thought to be a puny bit more ancient (aside, perchance, from
person namespaces, which nonetheless elevate discussions), these subsystems scheme the foundation of
Linux containers. Currently, most projects interesting Linux containers, love
Docker, LXC, OpenVZ, Kubernetes, and others, are consistent with each of
The style of the Linux cgroup subsystem started in 2006 at
Google, led primarily by Rohit Seth and Paul Menage. First and main secure
the project used to be known as “Job Containers”, but in a while the identify used to be
changed to “Set up watch over Groups”, to retain faraway from confusion with Linux containers, and
on the 2nd each person calls them “cgroups” for immediate.
There are currently 12 cgroup controllers in cgroups v1; all—excluding one—have
existed for several years. The fresh addition is the PIDs controller,
developed by Aditya Kali and merged
in kernel 4.3. It permits limiting the option of
processes created inside of a administration community, and it would possibly maybe probably well maybe additionally be extinct as an
anti-fork-bomb solution. The PID house in Linux consists of, at a
maximum, about four million PIDs (PID_MAX_LIMIT). Given nowadays’s RAM capacities, this
restrict could without instruct and quite fleet be exhausted by a fork bomb from
inside of a single container. The PIDs controller is supported by each cgroups v1 and cgroups v2.
Over time, there used to be a range of criticism referring to the
implementation of cgroups, which appears to be to repeat a option of
inconsistencies and a range of chaos. As an illustration, when organising
subgroups (cgroups inside of cgroups), several cgroup controllers propagate parameters to their prompt subgroups, while other
controllers scheme no longer. Or, for a decided instance, some controllers utilize
interface recordsdata (equivalent to the cpuset controller’s clone_children)
that seem in all controllers even supposing they most productive have an effect on one.
As maintainer Tejun Heo himself has
admitted [YouTube], “create adopted implementation”, “completely different choices had been
taken for diverse controllers”, and “most continuously too mighty flexibility
causes a hindrance”. In an LWN article
from 2012, it used to be acknowledged
that “administration groups are a spread of capabilities that kernel developers
love to disfavor.”
The cgroups v2 interface used to be declared non-experimental in
kernel 4.5. Nonetheless, the cgroups v1 subsystem used to be no longer eliminated
from the kernel, so, after the machine boots, each cgroups v1 and cgroups v2 are enabled by default. That you might utilize a combination of each of them, even supposing
you might well maybe perchance no longer utilize the identical variety of controller in each cgroups v1
and in cgroups v2 on the identical time.
It’s far rate bringing up that there could be a patch that adds a kernel
expose-line likelihood for disabling cgroups v1 controllers (cgroup_no_v1),
which used to be merged for kernel 4.6.
Kernel enhance for cgroups v1 will doubtlessly nonetheless exist for on the least
several more years, as long as there are person-house applications that
utilize it—quite love what we had previously with iptables and
ipchains, and what we inspect now with iptables and nftables. Some
person-house applications have already started migration to
cgroups v2—as an illustration, systemd and CGManager.
Every versions of cgroups are controlled by a synthetic
filesystem that will get mounted by the person. All around the ultimate three years or so, a particular mount likelihood used to be
accessible in cgroups v1 (__DEVEL__sane_behavior). This mount likelihood
enabled the utilization of obvious experimental capabilities, some of which formed the
foundation of cgroups v2 (the option used to be eliminated in kernel 4.5, alternatively). As an illustration, the utilization of this mount likelihood forces
the utilize the unified hierarchy mode, in which controller
administration is dealt with equally to the contrivance it is miles performed in cgroups v2.
The __DEVEL__sane_behavior mount likelihood is mutually current
with the mount alternate choices that had been eliminated in cgroups v2, love noprefix, clone_children, release_agent, and more.
Systemd began to make utilize of cgroups for carrier administration as a exchange of
for resource administration many years
ago. Every systemd carrier is mapped to a separate administration community. Nonetheless, the
migration of systemd to cgroups v2 is nonetheless partial, as it uses
the __DEVEL__sane_behavior mount likelihood. Additionally, in CGManager, most up-to-date enhance for cgroups v2 is partial: it is miles equipped most productive when the utilization of Upstart, and no longer when
the utilization of systemd.
Currently, three cgroup controllers come in in cgroups v2:
I/O, memory, and PIDs. There are already patches and discussions in
the cgroups mailing list about along side the CPU controller as properly.
There are additionally sharp patches along side enhance for resource
groups, posted appropriate ultimate week by Heo.
In cgroups v1, you might well maybe attach threads of the identical direction of to totally different cgroups, but right here will not be any longer imaginable in cgroups v2.
As a result, in-direction of resource-administration abilities, love the
means to govern CPU cycle distribution hierarchically between the
threads of a direction of, is lacking, as the entire threads belong to a
single cgroup. With the advised resource groups (rgroups)
infrastructure, this means could additionally be conducted as a natural extension of the setpriority() machine call.
Fundamental aspects of the cgroups v2 interface
Mounting cgroups v2 is
performed as follows:
mount -t cgroup2 none $MOUNT_POINT
Expose that the variety argument (following -t) specified has
changed; cgroups v1 extinct -t cgroup. As in cgroups v1, the mount point could additionally be wherever in the
filesystem. However, in distinction, there are no longer any mount alternate choices in any appreciate
in cgroups v2. One could utilize mount alternate choices to permit controllers in cgroups v1, but in
cgroups v2 right here is performed differently, as we are in a position to survey under. Creation of fresh subgroups in cgroups v2 is performed with
mkdir groupName, and elimination is performed with
After mounting cgroups v2, a cgroup root object is created, with
three cgroup core interface recordsdata below it. As an illustration, if
cgroups v2 is mounted on /sys/fs/cgroup2, the next
recordsdata are created under that itemizing:
- cgroup.controllers – This shows the supported cgroup controllers. All v2
controllers no longer certain to a v1 hierarchy are robotically
certain to the v2 hierarchy, and point to up in cgroup.controllers of the
cgroup root object.
- cgroup.procs – When the the cgroup filesystem is first
mounted, cgroup.procs in the foundation cgroup contains
the list of PIDs of all processes in the machine, as a exchange of
zombie processes. For every newly created subgroup, the cgroup.procs is
empty, as no direction of is attached to the newly created community. Attaching
a direction of to a subgroup is performed by writing its PID into the subgroup’s
- cgroup.subtree_control – This holds the controllers that are
enabled for the prompt subgroups. This entry is empty appropriate after mount, as no controllers
are enabled by default. Enabling and disabling controllers in the
prompt subgroups of a parent is performed most productive by writing into its
cgroup.subtree_control file. So, as an illustration, enabling the memory
controller is performed by:
echo "+memory" > /sys/fs/cgroup2/cgroup.subtree_control
and disabling it is miles performed by:
echo "-memory" > /sys/fs/cgroup2/cgroup.subtree_control
That you might allow/disable just a few controller in the identical expose
These three cgroup core interface recordsdata are additionally created for every
newly created subgroup. Except for these three recordsdata, a cgroup core
interface file known as cgroup.occasions is created. This interface file is
bizarre to non-root subgroups.
The cgroup.occasions file shows the option of processes attached to
the subgroup, and consists of 1 merchandise, “populated: rate“. The cost
is 0 when there are no longer any processes attached to that subgroup or its
descendants, and 1 when there are numerous processes attached to
that subgroup or its descendants.
As mentioned, subgroup creation is the same to the contrivance it is miles performed in cgroups v1.
However in cgroups v2, you might well maybe perchance most productive invent subgroups in a single
hierarchy, under the cgroups v2 mount point. When a fresh subgroup is created, the
rate of the “populated” entry in cgroup.occasions is 0, as you might well maybe
secure a matter to, as there’ll not be such a thing as a direction of yet attached to this newly created
That you might display screen occasions in this subgroup by calling poll(),
inotify(), or dnotify() from person house. Thus, you
could additionally be notified notified when those recordsdata swap, which is able to be extinct to
opt when the ultimate direction of attached to a subgroup terminates or when the first
direction of is attached to that subgroup. This mechanism is far more
ambiance good by efficiency than the parallel mechanism in
cgroups v1, the free up agent.
It’s far rate bringing up that this notification
mechanism can additionally be extinct by controller-particular interface
recordsdata. As an illustration, the cgroups v2 memory controller has an interface
file known as memory.occasions, which permits monitoring memory
occasions love out-of-memory (OOM) in a the same contrivance.
When a fresh subgroup is created, controller-particular recordsdata are
created for every enabled controller in this subgroup. As an illustration,
when the PIDs controller is enabled, two interface recordsdata are created:
pids.max and pids.most up-to-date, for atmosphere a restrict on the option of
processes forked in that subgroup, and for accounting of the option of
processes in that subgroup.
Let’s have interaction a inspect at two diagrams illustrating what we appropriate
described. The following sequence mounts cgroups v2 on /cgroup2 and
creates a subgroup known as “group1”, creates two subgroups of group1 (“nested1” and “nested2”), then permits the PIDs controller in group1:
mount -t cgroup2 nodev /cgroup2 mkdir /cgroup2/group1 mkdir /cgroup2/group1/nested1 mkdir /cgroup2/group1/nested2 echo +pids > /cgroup2/cgroup.subtree_control
The following draw illustrates the placement after working this
sequence. We can survey that the 2 PIDs controller interface recordsdata, pids.max and pids.most up-to-date, had been created for group1.
Now, if we race:
echo +pids > /cgroup2/group1/cgroup.subtree_control
this would well allow the PIDs controller in group1’s prompt subgroups,
nested1 and nested2. By writing +pids into the subtree_control
of the foundation cgroup, we most productive allow the PIDs controller in the foundation’s
bellow puny one subgroups and no other descendants. As a result,
the PIDs-controller–particular recordsdata
(pids.max and pids.most up-to-date) are created for every these newly-created
The next draw shows the placement after enabling the PIDs controller on group1.
The no-inside of-direction of rule
Unlike in cgroups v1, in cgroups v2 you might well maybe perchance join processes
most productive to leaves. This implies that you just might well maybe perchance no longer join a direction of to an
inside of subgroup if it has any controller enabled. The rationale slack
this rule is that processes in a given subgroup competing for sources
with threads attached to its parent community invent fundamental
The following draw illustrates this.
(Expose: at the same time as you write 0 into
cgroup.procs, this would well write the PID of the plot
performing the writing
into the file.)
discusses the no-inside of-direction of rule in additional detail.
In cgroups v1, a direction of can belong to many subgroups, if
those subgroups are in completely different hierarchies with completely different
controllers attached. However, because belonging to just a few
subgroup made it complex to disambiguate subgroup membership, in
cgroups v2, a direction of can belong most productive to a single subgroup.
inspect at an instance when this restriction is important. In cgroups v1,
there are two community controllers: net_prio (written by Neil
Horman) and net_cls (by Thomas Graf). These controllers had been
no longer extended to enhance cgroups v2. In its secure, the
xt_cgroup netfilter matching module used to be extended to enhance
matching by a cgroup direction. As an illustration, the next iptables rule
matches web explain web explain visitors that used to be generated by a socket created in a direction of
attached to mygroup (or its descendants):
iptables -A OUTPUT -m cgroup --direction mygroup -j LOG
This type of match will not be any longer imaginable in cgroups v1, because most continuously a
direction of can belong to more than a single subgroup. In cgroups v2, this
project does no longer exist, thanks to the single-subgroup rule.
Work is ongoing; as well to the resource-community patches
mentioned earlier, there are patches for a fresh
RDMA cgroup controller that are currently in the pipeline. This patch secure
permits resource accounting and restrict enforcement on a per-cgroup,
per-RDMA-tool foundation. These patches are in the submit-RFC phase, and
are in the ninth iteration as of this writing; it appears to be seemingly that they are
to be merged soon.
As we now have considered, the fresh interface of cgroups v2,
which used to be no longer too long ago declared stable in the kernel, has several benefits over
cgroups v1, equivalent to its notification-to-person-house mechanism.
Although the cgroups v2
implementation is nonetheless in its initial stages, it appears to be to be mighty
greater organized and more consistent than cgroups v1.
(Log in to submit feedback)