Author: Xing Yang (VMware), Ashutosh Kumar (VMware)
Kubernetes v1.24 introduced an alpha quality implementation of improvements
for handling a non-graceful node shutdown.
In Kubernetes v1.26, this feature moves to beta. This feature allows stateful workloads to failover to a different node after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.
What is a node shutdown in Kubernetes?
In a Kubernetes cluster, it is possible for a node to shut down. This could happen either in a planned way or it could happen unexpectedly. You may plan for a security patch, or a kernel upgrade and need to reboot the node, or it may shut down due to preemption of VM instances. A node may also shut down due to a hardware failure or a software problem.
To trigger a node shutdown, you could run a shutdown or poweroff command in a shell,
or physically press a button to power off a machine.
A node shutdown could lead to workload failure if the node is not drained before the shutdown.
In the following, we will describe what is a graceful node shutdown and what is a non-graceful node shutdown.
What is a graceful node shutdown?
The kubelet's handling for a graceful node shutdown
allows the kubelet to detect a node shutdown event, properly terminate the pods on that node,
and release resources before the actual shutdown.
Critical pods
are terminated after all the regular pods are terminated, to ensure that the
essential functions of an application can continue to work as long as possible.
What is a non-graceful node shutdown?
A Node shutdown can be graceful only if the kubelet's node shutdown manager can
detect the upcoming node shutdown action. However, there are cases where a kubelet
does not detect a node shutdown action. This could happen because the shutdown
command does not trigger the Inhibitor Locks mechanism used by the kubelet on Linux, or because of a user error. For example, if
the shutdownGracePeriod and shutdownGracePeriodCriticalPods details are not
configured correctly for that node.
When a node is shut down (or crashes), and that shutdown was not detected by the kubelet
node shutdown manager, it becomes a non-graceful node shutdown. Non-graceful node shutdown
is a problem for stateful apps.
If a node containing a pod that is part of a StatefulSet is shut down in a non-graceful way, the Pod
will be stuck in Terminating status indefinitely, and the control plane cannot create a replacement
Pod for that StatefulSet on a healthy node.
You can delete the failed Pods manually, but this is not ideal for a self-healing cluster.
Similarly, pods that ReplicaSets created as part of a Deployment will be stuck in Terminating status, and
that were bound to the now-shutdown node, stay as Terminating indefinitely.
If you have set a horizontal scaling limit, even those terminating Pods count against the limit,
so your workload may struggle to self-heal if it was already at maximum scale.
(By the way: if the node that had done a non-graceful shutdown comes back up, the kubelet does delete
the old Pod, and the control plane can make a replacement.)
What's new for the beta?
For Kubernetes v1.26, the non-graceful node shutdown feature is beta and enabled by default.
The NodeOutOfServiceVolumeDetachfeature gate is enabled by default
on kube-controller-manager instead of being opt-in; you can still disable it if needed
(please also file an issue to explain the problem).
On the instrumentation side, the kube-controller-manager reports two new metrics.
force_delete_pods_total
number of pods that are being forcibly deleted (resets on Pod garbage collection controller restart)
force_delete_pod_errors_total
number of errors encountered when attempting forcible Pod deletion (also resets on Pod garbage collection controller restart)
How does it work?
In the case of a node shutdown, if a graceful shutdown is not working or the node is in a
non-recoverable state due to hardware failure or broken OS, you can manually add an out-of-service
taint on the Node. For example, this can be node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
or node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule. This taint trigger pods on the node to
be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node.
Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown
or power-off state (not in the middle of restarting), either because the user intentionally shut it down
or the node is down due to hardware failures, OS issues, etc.
Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered.
What’s next?
Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to GA in either 1.27 or 1.28.
This feature requires a user to manually add a taint to the node to trigger the failover of workloads and remove the taint after the node is recovered.
The cluster operator can automate this process by automatically applying the out-of-service taint
if there is a programmatic way to determine that the node is really shut down and there isn’t IO between
the node and storage. The cluster operator can then automatically remove the taint after the workload
fails over successfully to another running node and that the shutdown node has been recovered.
In the future, we plan to find ways to automatically detect and fence nodes that are shut down or in a non-recoverable state and fail their workloads over to another node.
There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the KEP and implementation over the last couple of years.
This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.
Authors: Patrick Ohly (Intel), Kevin Klues (NVIDIA)
Dynamic resource allocation is a new API for requesting resources. It is a
generalization of the persistent volumes API for generic resources, making it possible to:
access the same resource instance in different pods and containers,
attach arbitrary constraints to a resource request to get the exact resource
you are looking for,
initialize a resource according to parameters provided by the user.
Third-party resource drivers are responsible for interpreting these parameters
as well as tracking and allocating resources as requests come in.
Dynamic resource allocation is an alpha feature and only enabled when the
DynamicResourceAllocationfeature
gate and the
resource.k8s.io/v1alpha1API group are enabled. For details, see the
--feature-gates and --runtime-configkube-apiserver
parameters.
The kube-scheduler, kube-controller-manager and kubelet components all need
the feature gate enabled as well.
The default configuration of kube-scheduler enables the DynamicResources
plugin if and only if the feature gate is enabled. Custom configurations may
have to be modified to include it.
Once dynamic resource allocation is enabled, resource drivers can be installed
to manage certain kinds of hardware. Kubernetes has a test driver that is used
for end-to-end testing, but also can be run manually. See
below for step-by-step instructions.
API
The new resource.k8s.io/v1alpha1API group provides four new types:
ResourceClass
Defines which resource driver handles a certain kind of
resource and provides common parameters for it. ResourceClasses
are created by a cluster administrator when installing a resource
driver.
ResourceClaim
Defines a particular resource instances that is required by a
workload. Created by a user (lifecycle managed manually, can be shared
between different Pods) or for individual Pods by the control plane based on
a ResourceClaimTemplate (automatic lifecycle, typically used by just one
Pod).
ResourceClaimTemplate
Defines the spec and some meta data for creating
ResourceClaims. Created by a user when deploying a workload.
PodScheduling
Used internally by the control plane and resource drivers
to coordinate pod scheduling when ResourceClaims need to be allocated
for a Pod.
Parameters for ResourceClass and ResourceClaim are stored in separate objects,
typically using the type defined by a CRD that was created when
installing a resource driver.
With this alpha feature enabled, the spec of Pod defines ResourceClaims that are needed for a Pod
to run: this information goes into a new
resourceClaims field. Entries in that list reference either a ResourceClaim
or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using
this .spec (for example, inside a Deployment or StatefulSet) share the same
ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets
its own ResourceClaim instance.
For a container defined within a Pod, the resources.claims list
defines whether that container gets
access to these resource instances, which makes it possible to share resources
between one or more containers inside the same Pod. For example, an init container could
set up the resource before the application uses it.
Here is an example of a fictional resource driver. Two ResourceClaim objects
will get created for this Pod and each container gets access to one of them.
Assuming a resource driver called resource-driver.example.com was installed
together with the following resource class:
In contrast to native resources (such as CPU or RAM) and
extended resources
(managed by a
device plugin, advertised by kubelet), the scheduler has no knowledge of what
dynamic resources are available in a cluster or how they could be split up to
satisfy the requirements of a specific ResourceClaim. Resource drivers are
responsible for that. Drivers mark ResourceClaims as allocated once resources
for it are reserved. This also then tells the scheduler where in the cluster a
claimed resource is actually available.
ResourceClaims can get resources allocated as soon as the ResourceClaim
is created (immediate allocation), without considering which Pods will use
the resource. The default (wait for first consumer) is to delay allocation until
a Pod that relies on the ResourceClaim becomes eligible for scheduling.
This design with two allocation options is similar to how Kubernetes handles
storage provisioning with PersistentVolumes and PersistentVolumeClaims.
In the wait for first consumer mode, the scheduler checks all ResourceClaims needed
by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling
(a special object that requests scheduling details on behalf of the Pod). The PodScheduling
has the same name and namespace as the Pod and the Pod as its as owner.
Using its PodScheduling, the scheduler informs the resource drivers
responsible for those ResourceClaims about nodes that the scheduler considers
suitable for the Pod. The resource drivers respond by excluding nodes that
don't have enough of the driver's resources left.
Once the scheduler has that resource
information, it selects one node and stores that choice in the PodScheduling
object. The resource drivers then allocate resources based on the relevant
ResourceClaims so that the resources will be available on that selected node.
Once that resource allocation is complete, the scheduler attempts to schedule the Pod
to a suitable node. Scheduling can still fail at this point; for example, a different Pod could
be scheduled to the same node in the meantime. If this happens, already allocated
ResourceClaims may get deallocated to enable scheduling onto a different node.
As part of this process, ResourceClaims also get reserved for the
Pod. Currently ResourceClaims can either be used exclusively by a single Pod or
an unlimited number of Pods.
One key feature is that Pods do not get scheduled to a node unless all of
their resources are allocated and reserved. This avoids the scenario where
a Pod gets scheduled onto one node and then cannot run there, which is bad
because such a pending Pod also blocks all other resources like RAM or CPU that were
set aside for it.
Limitations
The scheduler plugin must be involved in scheduling Pods which use
ResourceClaims. Bypassing the scheduler by setting the nodeName field leads
to Pods that the kubelet refuses to start because the ResourceClaims are not
reserved or not even allocated. It may be possible to remove this
limitation in the
future.
Writing a resource driver
A dynamic resource allocation driver typically consists of two separate-but-coordinating
components: a centralized controller, and a DaemonSet of node-local kubelet
plugins. Most of the work required by the centralized controller to coordinate
with the scheduler can be handled by boilerplate code. Only the business logic
required to actually allocate ResourceClaims against the ResourceClasses owned
by the plugin needs to be customized. As such, Kubernetes provides
the following package, including APIs for invoking this boilerplate code as
well as a Driver interface that you can implement to provide their custom
business logic:
Likewise, boilerplate code can be used to register the node-local plugin with
the kubelet, as well as start a gRPC server to implement the kubelet plugin
API. For drivers written in Go, the following package is recommended:
It is up to the driver developer to decide how these two components
communicate. The KEP outlines an approach using
CRDs.
Within SIG Node, we also plan to provide a complete example
driver that can serve
as a template for other drivers.
Running the test driver
The following steps bring up a local, one-node cluster directly from the
Kubernetes source code. As a prerequisite, your cluster must have nodes with a container
runtime that supports the
Container Device Interface
(CDI). For example, you can run CRI-O v1.23.2 or later.
Once containerd v1.7.0 is released, we expect that you can run that or any later version.
In the example below, we use CRI-O.
First, clone the Kubernetes source code. Inside that directory, run:
$ hack/install-etcd.sh
...
$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \
FEATURE_GATES=DynamicResourceAllocation=true \
DNS_ADDON="coredns" \
CGROUP_DRIVER=systemd \
CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \
LOG_LEVEL=6 \
ENABLE_CSI_SNAPSHOTTER=false \
API_SECURE_PORT=6444 \
ALLOW_PRIVILEGED=1 \
PATH=$(pwd)/third_party/etcd:$PATH \
./hack/local-up-cluster.sh -O
...
To start using your cluster, you can open up another terminal/tab and run:
export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
...
Once the cluster is up, in another
terminal run the test driver controller. KUBECONFIG must be set for all of
the following commands.
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller
Changing the permissions of the directories makes it possible to run and (when
using delve) debug the kubelet plugin as a normal user, which is convenient
because it uses the already populated Go cache. Remember to restore permissions
with sudo chmod go-w when done. Alternatively, you can also build the binary
and run that as root.
Now the cluster is ready to create objects:
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml
resourceclass.resource.k8s.io/example created
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
configmap/test-inline-claim-parameters created
resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created
pod/test-inline-claim created
$ kubectl get resourceclaims
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
test-inline-claim-resource example WaitForFirstConsumer allocated,reserved 8s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-inline-claim 0/2 Completed 0 21s
The test driver doesn't do much, it only sets environment variables as defined
in the ConfigMap. The test pod dumps the environment, so the log can be checked
to verify that everything worked:
You can view or comment on the project board
for dynamic resource allocation.
In order to move this feature towards beta, we need feedback from hardware
vendors, so here's a call to action: try out this feature, consider how it can help
with problems that your users are having, and write resource drivers…
Authors: Brandon Smith (Microsoft) and Mark Rossetti (Microsoft)
The long-awaited day has arrived: HostProcess containers, the Windows equivalent to Linux privileged
containers, has finally made it to GA in Kubernetes 1.26!
What are HostProcess containers and why are they useful?
Cluster operators are often faced with the need to configure their nodes upon provisioning such as
installing Windows services, configuring registry keys, managing TLS certificates,
making network configuration changes, or even deploying monitoring tools such as a Prometheus's node-exporter.
Previously, performing these actions on Windows nodes was usually done by running PowerShell scripts
over SSH or WinRM sessions and/or working with your cloud provider's virtual machine management tooling.
HostProcess containers now enable you to do all of this and more with minimal effort using Kubernetes native APIs.
With HostProcess containers you can now package any payload
into the container image, map volumes into containers at runtime, and manage them like any other Kubernetes workload.
You get all the benefits of containerized packaging and deployment methods combined with a reduction in
both administrative and development cost.
Gone are the days where cluster operators would need to manually log onto
Windows nodes to perform administrative duties.
HostProcess containers differ
quite significantly from regular Windows Server containers.
They are run directly as processes on the host with the access policies of
a user you specify. HostProcess containers run as either the built-in Windows system accounts or
ephemeral users within a user group defined by you. HostProcess containers also share
the host's network namespace and access/configure storage mounts visible to the host.
On the other hand, Windows Server containers are highly isolated and exist in a separate
execution namespace. Direct access to the host from a Windows Server container is explicitly disallowed
by default.
How does it work?
Windows HostProcess containers are implemented with Windows Job Objects,
a break from the previous container model which use server silos.
Job Objects are components of the Windows OS which offer the ability to
manage a group of processes as a group (also known as a job) and assign resource constraints to the
group as a whole. Job objects are specific to the Windows OS and are not associated with
the Kubernetes Job API. They have no process
or file system isolation,
enabling the privileged payload to view and edit the host file system with the
desired permissions, among other host resources. The init process, and any processes
it launches (including processes explicitly launched by the user) are all assigned to the
job object of that container. When the init process exits or is signaled to exit,
all the processes in the job will be signaled to exit, the job handle will be
closed and the storage will be unmounted.
HostProcess and Linux privileged containers enable similar scenarios but differ
greatly in their implementation (hence the naming difference). HostProcess containers
have their own PodSecurityContext fields.
Those used to configure Linux privileged containers do not apply. Enabling privileged access to a Windows host is a
fundamentally different process than with Linux so the configuration and
capabilities of each differ significantly. Below is a diagram detailing the
overall architecture of Windows HostProcess containers:
Two major features were added prior to moving to stable: the ability to run as local user accounts, and
a simplified method of accessing volume mounts. To learn more, read
Create a Windows HostProcess Pod.
HostProcess containers in action
Kubernetes SIG Windows has been busy putting HostProcess containers to use - even before GA!
They've been very excited to use HostProcess containers for a number of important activities
that were a pain to perform in the past.
Here are just a few of the many use use cases with example deployments:
A HostProcess container can be built using any base image of your choosing, however, for convenience we have
created a HostProcess container base image.
This image is only a few KB in size and does not inherit any of the same compatibility requirements as regular Windows
server containers which allows it to run on any Windows server version.
To use that Microsoft image, put this in your Dockerfile:
FROM mcr.microsoft.com/oss/kubernetes/windows-host-process-containers-base-image:v1.0.0
You can run HostProcess containers from within a
HostProcess Pod.
To get started with running Windows containers,
see the general guidance for deploying Windows nodes.
If you have a compatible node (for example: Windows as the operating system
with containerd v1.7 or later as the container runtime), you can deploy a Pod with one
or more HostProcess containers.
See the Create a Windows HostProcess Pod - Prerequisites
for more information.
Please note that within a Pod, you can't mix HostProcess containers with normal Windows containers.
The Kubernetes Special Interest Group (SIG) Release is proud to announce that we
are digitally signing all release artifacts, and that this aspect of Kubernetes
has now reached beta.
Signing artifacts provides end users a chance to verify the integrity of the
downloaded resource. It allows to mitigate man-in-the-middle attacks directly on
the client side and therefore ensures the trustfulness of the remote serving the
artifacts. The overall goal of out past work was to define the used tooling for
signing all Kubernetes related artifacts as well as providing a standard signing
process for related projects (for example for those in kubernetes-sigs).
We already signed all officially released container images (from Kubernetes v1.24 onwards).
Image signing was alpha for v1.24 and v1.25. For v1.26, we've added all
binary artifacts to the signing process as well! This means that now all
client, server and source tarballs, binary artifacts,
Software Bills of Material (SBOMs) as well as the build
provenance will be signed using cosign. Technically
speaking, we now ship additional *.sig (signature) and *.cert (certificate)
files side by side to the artifacts for verifying their integrity.
To verify an artifact, for example kubectl, you can download the
signature and certificate alongside with the binary. I use the release candidate
rc.1 of v1.26 for demonstration purposes because the final has not been released yet:
The HashedRekordObj.signature.content should match the content of the file
kubectl.sig and HashedRekordObj.signature.publicKey.content should be
identical with the contents of kubectl.cert. It is also possible to specify
the remote certificate and signature locations without downloading them:
tlog entry verified with uuid: 5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657 index: 8173886
Verified OK
All of the mentioned steps as well as how to verify container images are
outlined in the official documentation about how to Verify Signed Kubernetes
Artifacts. In one of the next upcoming Kubernetes releases we will
working making the global story more mature by ensuring that truly all
Kubernetes artifacts are signed. Beside that, we are considering using Kubernetes
owned infrastructure for the signing (root trust) and verification (transparency
log) process.
Getting involved
If you're interested in contributing to SIG Release, then consider applying for
the upcoming v1.27 shadowing program (watch for the announcement on
k-dev) or join our weekly meeting to say hi.
Thank you for reading this blog post! I'd like to use this opportunity to give
all involved SIG Release folks a special shout-out for shipping this feature in
time!
It's with immense joy that we announce the release of Kubernetes v1.26!
This release includes a total of 37 enhancements: eleven of them are graduating to Stable, ten are
graduating to Beta, and sixteen of them are entering Alpha. We also have twelve features being
deprecated or removed, three of which we better detail in this announcement.
Release theme and logo
Kubernetes 1.26: Electrifying
The theme for Kubernetes v1.26 is Electrifying.
Each Kubernetes release is the result of the coordinated effort of dedicated volunteers, and only
made possible due to the use of a diverse and complex set of computing resources, spread out through
multiple datacenters and regions worldwide. The end result of a release - the binaries, the image
containers, the documentation - are then deployed on a growing number of personal, on-premises, and
cloud computing resources.
In this release we want to recognise the importance of all these building blocks on which Kubernetes
is developed and used, while at the same time raising awareness on the importance of taking the
energy consumption footprint into account: environmental sustainability is an inescapable concern of
creators and users of any software solution, and the environmental footprint of sofware, like
Kubernetes, an area which we believe will play a significant role in future releases.
As a community, we always work to make each new release process better than before (in this release,
we have started to use Projects for tracking
enhancements, for example). If v1.24
"Stargazer"had us looking upwards, to
what is possible when our community comes together, and v1.25
"Combiner"what the combined efforts of our community
are capable of, this v1.26 "Electrifying" is also dedicated to all of those whose individual
motion, integrated into the release flow, made all of this possible.
Major themes
Kubernetes v1.26 is composed of many changes, brought to you by a worldwide team of volunteers. For
this release, we have identified several major themes.
Change in container image registry
In the previous release, Kubernetes changed the container
registry, allowing the spread of the load
across multiple Cloud Providers and Regions, a change that reduced the reliance on a single entity
and provided a faster download experience for a large number of users.
This release of Kubernetes is the first that is exclusively published in the new registry.k8s.io
container image registry. In the (now legacy) k8s.gcr.io image registry, no container images tags
for v1.26 will be published, and only tags from releases before v1.26 will continue to be
updated. Refer to registry.k8s.io: faster, cheaper and Generally
Available for more information on the
motivation, advantages, and implications of this significant change.
CRI v1alpha2 removed
With the adoption of the Container Runtime Interface (CRI) and
the removal of dockershim in v1.24, the CRI is the only
supported and documented way through which Kubernetes interacts with different container
runtimes. Each kubelet negotiates which version of CRI to use with the container runtime on that
node.
In the previous release, the Kubernetes project recommended using CRI version v1, but kubelet
could still negotiate the use of CRI v1alpha2, which was deprecated.
Kubernetes v1.26 drops support for CRI v1alpha2. That
removal will result in the kubelet not
registering the node if the container runtime doesn't support CRI v1. This means that containerd
minor version 1.5 and older are not supported in Kubernetes 1.26; if you use containerd, you will
need to upgrade to containerd version 1.6.0 or later before you upgrade that node to Kubernetes
v1.26. This applies equally to any other container runtimes that only support the v1alpha2: if
that affects you, you should contact the container runtime vendor for advice or check their website
for additional instructions in how to move forward.
Storage improvements
Following the GA of the core Container Storage Interface (CSI)
Migration
feature in the previous release, CSI migration is an on-going effort that we've been working on for
a few releases now, and this release continues to add (and remove) features aligned with the
migration's goals, as well as other improvements to Kubernetes storage.
CSI migration for Azure File and vSphere graduated to stable
Delegate FSGroup to CSI Driver graduated to stable
This feature allows Kubernetes to supply the pod's fsGroup to the CSI driver when a volume is
mounted so that the driver can utilize
mount options to control volume permissions. Previously, the kubelet would always apply the
fsGroupownership and permission change to files in the volume according to the policy specified in
the Pod's .spec.securityContext.fsGroupChangePolicy field. Starting with this release, CSI
drivers have the option to apply the fsGroup settings during attach or mount time of the volumes.
In-tree GlusterFS driver removal
Already deprecated in the v1.25 release, the in-tree GlusterFS driver was
removed in this release.
Signing Kubernetes release artifacts graduates to beta
Introduced in Kubernetes v1.24, this
feature constitutes a significant milestone
in improving the security of the Kubernetes release process. All release artifacts are signed
keyless using cosign, and both binary artifacts and images
can be verified.
Support for Windows privileged containers graduates to stable
Privileged container support allows containers to run with similar access to the host as processes
that run on the host directly. Support for this feature in Windows nodes, called HostProcess
containers, will now graduate to Stable,
enabling access to host resources (including network resources) from privileged containers.
Improvements to Kubernetes metrics
This release has several noteworthy improvements on metrics.
Component Health Service Level Indicators graduates to alpha
Also improving on the ability to consume Kubernetes metrics, component health Service Level
Indicators (SLIs) have graduated to
Alpha: by enabling the ComponentSLIs
feature flag there will be an additional metrics endpoint which allows the calculation of Service
Level Objectives (SLOs) from raw healthcheck data converted into metric format.
Feature metrics are now available
Feature metrics are now available for each Kubernetes component, making it possible to track
whether each active feature gate is enabled
by checking the component's metric endpoint for kubernetes_feature_enabled.
Dynamic Resource Allocation graduates to alpha
Dynamic Resource
Allocation
is a new feature
that puts resource scheduling in the hands of third-party developers: it offers an
alternative to the limited "countable" interface for requesting access to resources
(e.g. nvidia.com/gpu: 2), providing an API more akin to that of persistent volumes. Under the
hood, it uses the Container Device
Interface (CDI) to do
its device injection. This feature is blocked by the DynamicResourceAllocation feature gate.
CEL in Admission Control graduates to alpha
This feature introduces a v1alpha1 API for validating admission
policies, enabling extensible admission
control via Common Expression Language expressions. Currently,
custom policies are enforced via admission
webhooks,
which, while flexible, have a few drawbacks when compared to in-process policy enforcement. To use,
enable the ValidatingAdmissionPolicy feature gate and the admissionregistration.k8s.io/v1alpha1
API via --runtime-config.
Pod scheduling improvements
Kubernetes v1.26 introduces some relevant enhancements to the ability to better control scheduling
behavior.
NodeInclusionPolicyInPodTopologySpread graduates to beta
By specifying a nodeInclusionPolicy in topologySpreadConstraints, you can control whether to
take taints/tolerations into consideration
when calculating Pod Topology Spread skew.
Other Updates
Graduations to stable
This release includes a total of eleven enhancements promoted to Stable:
The complete details of the Kubernetes v1.26 release are available in our release
notes.
Availability
Kubernetes v1.26 is available for download on the Kubernetes site.
To get started with Kubernetes, check out these interactive tutorials or run local
Kubernetes clusters using containers as "nodes", with kind. You can also
easily install v1.26 using kubeadm.
Release team
Kubernetes is only possible with the support, commitment, and hard work of its community. Each
release team is made up of dedicated community volunteers who work together to build the many pieces
that make up the Kubernetes releases you rely on. This requires the specialized skills of people
from all corners of our community, from the code itself to its documentation and project management.
We would like to thank the entire release team
for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.26 release for our community.
A very special thanks is in order for our Release Lead, Leonard Pahlke, for successfully steering
the entire release team throughout the entire release cycle, by making sure that we could all
contribute in the best way possible to this release through his constant support and attention to
the many and diverse details that make up the path to a successful release.
KubeCon + CloudNativeCon Europe 2023 will take place in Amsterdam, The Netherlands, from 17 – 21
April 2023! You can find more information about the conference and registration on the event
site.
CloudNativeSecurityCon North America, a two-day event designed to foster collaboration, discussion
and knowledge sharing of cloud native security projects and how to best use these to address
security challenges and opportunities, will take place in Seattle, Washington (USA), from 1-2
February 2023. See the event
page for more
information.
The CNCF announced the 2022 Community Awards
Winners:
the Community Awards recognize CNCF community members that are going above and beyond to advance
cloud native technology.
Project velocity
The CNCF K8s DevStats project
aggregates a number of interesting data points related to the velocity of Kubernetes and various
sub-projects. This includes everything from individual contributions to the number of companies that
are contributing, and is an illustration of the depth and breadth of effort that goes into evolving
this ecosystem.
Join members of the Kubernetes v1.26 release team on Tuesday January 17, 2023 10am - 11am EST (3pm - 4pm UTC) to learn about the major features
of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event
page.
Get Involved
The simplest way to get involved with Kubernetes is by joining one of the many Special Interest
Groups (SIGs) that align with your
interests.
Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly
community meeting, and through
the channels below:
Forensic container checkpointing is based on Checkpoint/Restore In
Userspace (CRIU) and allows the creation of stateful copies
of a running container without the container knowing that it is being
checkpointed. The copy of the container can be analyzed and restored in a
sandbox environment multiple times without the original container being aware
of it. Forensic container checkpointing was introduced as an alpha feature in
Kubernetes v1.25.
How does it work?
With the help of CRIU it is possible to checkpoint and restore containers.
CRIU is integrated in runc, crun, CRI-O and containerd and forensic container
checkpointing as implemented in Kubernetes uses these existing CRIU
integrations.
Why is it important?
With the help of CRIU and the corresponding integrations it is possible to get
all information and state about a running container on disk for later forensic
analysis. Forensic analysis might be important to inspect a suspicious
container without stopping or influencing it. If the container is really under
attack, the attacker might detect attempts to inspect the container. Taking a
checkpoint and analysing the container in a sandboxed environment offers the
possibility to inspect the container without the original container and maybe
attacker being aware of the inspection.
In addition to the forensic container checkpointing use case, it is also
possible to migrate a container from one node to another node without loosing
the internal state. Especially for stateful containers with long initialization
times restoring from a checkpoint might save time after a reboot or enable much
faster startup times.
How do I use container checkpointing?
The feature is behind a feature gate, so
make sure to enable the ContainerCheckpoint gate before you can use the new
feature.
The runtime must also support container checkpointing:
containerd: support is currently under discussion. See containerd
pull request #6965 for more details.
CRI-O: v1.25 has support for forensic container checkpointing.
Usage example with CRI-O
To use forensic container checkpointing in combination with CRI-O, the runtime
needs to be started with the command-line option --enable-criu-support=true.
For Kubernetes, you need to run your cluster with the ContainerCheckpoint
feature gate enabled. As the checkpointing functionality is provided by CRIU it
is also necessary to install CRIU. Usually runc or crun depend on CRIU and
therefore it is installed automatically.
It is also important to mention that at the time of writing the checkpointing functionality is
to be considered as an alpha level feature in CRI-O and Kubernetes and the
security implications are still under consideration.
Once containers and pods are running it is possible to create a checkpoint.
Checkpointing
is currently only exposed on the kubelet level. To checkpoint a container,
you can run curl on the node where that container is running, and trigger a
checkpoint:
curl -X POST "https://localhost:10250/checkpoint/namespace/podId/container"
For a container named counter in a pod named counters in a namespace named
default the kubelet API endpoint is reachable at:
curl -X POST "https://localhost:10250/checkpoint/default/counters/counter"
For completeness the following curl command-line options are necessary to
have curl accept the kubelet's self signed certificate and authorize the
use of the kubeletcheckpoint API:
Triggering this kubelet API will request the creation of a checkpoint from
CRI-O. CRI-O requests a checkpoint from your low-level runtime (for example,
runc). Seeing that request, runc invokes the criu tool
to do the actual checkpointing.
Once the checkpointing has finished the checkpoint should be available at
/var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar
You could then use that tar archive to restore the container somewhere else.
Restore a checkpointed container outside of Kubernetes (with CRI-O)
With the checkpoint tar archive it is possible to restore the container outside
of Kubernetes in a sandboxed instance of CRI-O. For better user experience
during restore, I recommend that you use the latest version of CRI-O from the
main CRI-O GitHub branch. If you're using CRI-O v1.25, you'll need to
manually create certain directories Kubernetes would create before starting the
container.
The first step to restore a container outside of Kubernetes is to create a pod sandbox
using crictl:
crictl runp pod-config.json
Then you can restore the previously checkpointed container into the newly created pod sandbox:
Instead of specifying a container image in a registry in container-config.json
you need to specify the path to the checkpoint archive that you created earlier:
Next, run crictl start <CONTAINER_ID> to start that container, and then a
copy of the previously checkpointed container should be running.
Restore a checkpointed container within of Kubernetes
To restore the previously checkpointed container directly in Kubernetes it is
necessary to convert the checkpoint archive into an image that can be pushed to
a registry.
One possible way to convert the local checkpoint archive consists of the
following steps with the help of buildah:
The resulting image is not standardized and only works in combination with
CRI-O. Please consider this image format as pre-alpha. There are ongoing
discussions to standardize the format of checkpoint
images like this. Important to remember is that this not yet standardized image
format only works if CRI-O has been started with --enable-criu-support=true.
The security implications of starting CRI-O with CRIU support are not yet clear
and therefore the functionality as well as the image format should be used with
care.
Now, you'll need to push that image to a container image registry. For example:
To restore this checkpoint image (container-image-registry.example/user/checkpoint-image:latest), the
image needs to be listed in the specification for a Pod. Here's an example
manifest:
Kubernetes schedules the new Pod onto a node. The kubelet on that node
instructs the container runtime (CRI-O in this example) to create and start a
container based on an image specified as registry/user/checkpoint-image:latest.
CRI-O detects that registry/user/checkpoint-image:latest
is a reference to checkpoint data rather than a container image. Then,
instead of the usual steps to create and start a container,
CRI-O fetches the checkpoint data and restores the container from that
specified checkpoint.
The application in that Pod would continue running as if the checkpoint had not been taken;
within the container, the application looks and behaves like any other container that had been
started normally and not restored from a checkpoint.
With these steps, it is possible to replace a Pod running on one node
with a new equivalent Pod that is running on a different node,
and without losing the state of the containers in that Pod.
Debugging software in production is one of the biggest challenges we have to
face in our containerized environments. Being able to understand the impact of
the available security options, especially when it comes to configuring our
deployments, is one of the key aspects to make the default security in
Kubernetes stronger. We have all those logging, tracing and metrics data already
at hand, but how do we assemble the information they provide into something
human readable and actionable?
Seccomp is one of the standard mechanisms to protect a Linux based
Kubernetes application from malicious actions by interfering with its system
calls. This allows us to restrict the application to a defined set of
actionable items, like modifying files or responding to HTTP requests. Linking
the knowledge of which set of syscalls is required to, for example, modify a
local file, to the actual source code is in the same way non-trivial. Seccomp
profiles for Kubernetes have to be written in JSON and can be understood
as an architecture specific allow-list with superpowers, for example:
The above profile errors by default specifying the defaultAction of
SCMP_ACT_ERRNO. This means we have to allow a set of syscalls via
SCMP_ACT_ALLOW, otherwise the application would not be able to do anything at
all. Okay cool, for being able to allow file operations, all we have to do is
adding a bunch of file specific syscalls like open or write, and probably
also being able to change the permissions via chmod and chown, right?
Basically yes, but there are issues with the simplicity of that approach:
Seccomp profiles need to include the minimum set of syscalls required to start
the application. This also includes some syscalls from the lower level
Open Container Initiative (OCI) container runtime, for example
runc or crun. Beside that, we can only guarantee the required
syscalls for a very specific version of the runtimes and our application,
because the code parts can change between releases. The same applies to the
termination of the application as well as the target architecture we're
deploying on. Features like executing commands within containers also require
another subset of syscalls. Not to mention that there are multiple versions for
syscalls doing slightly different things and the seccomp profiles are able to
modify their arguments. It's also not always clearly visible to the developers
which syscalls are used by their own written code parts, because they rely on
programming language abstractions or frameworks.
How can we know which syscalls are even required then? Who should create and
maintain those profiles during its development life-cycle?
Well, recording and distributing seccomp profiles is one of the problem domains
of the Security Profiles Operator, which is already solving that. The
operator is able to record seccomp, SELinux and even
AppArmor profiles into a Custom Resource Definition (CRD),
reconciles them to each node and makes them available for usage.
The biggest challenge about creating security profiles is to catch all code
paths which execute syscalls. We could achieve that by having 100% logical
coverage of the application when running an end-to-end test suite. You get the
problem with the previous statement: It's too idealistic to be ever fulfilled,
even without taking all the moving parts during application development and
deployment into account.
Missing a syscall in the seccomp profiles' allow list can have tremendously
negative impact on the application. It's not only that we can encounter crashes,
which are trivially detectable. It can also happen that they slightly change
logical paths, change the business logic, make parts of the application
unusable, slow down performance or even expose security vulnerabilities. We're
simply not able to see the whole impact of that, especially because blocked
syscalls via SCMP_ACT_ERRNO do not provide any additional audit
logging on the system.
Does that mean we're lost? Is it just not realistic to dream about a Kubernetes
where everyone uses the default seccomp profile? Should we
stop striving towards maximum security in Kubernetes and accept that it's not
meant to be secure by default?
Definitely not. Technology evolves over time and there are many folks
working behind the scenes of Kubernetes to indirectly deliver features to
address such problems. One of the mentioned features is the seccomp notifier,
which can be used to find suspicious syscalls in Kubernetes.
The seccomp notify feature consists of a set of changes introduced in Linux 5.9.
It makes the kernel capable of communicating seccomp related events to the user
space. That allows applications to act based on the syscalls and opens for a
wide range of possible use cases. We not only need the right kernel version,
but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier
work at all. The Kubernetes container runtime CRI-O gets support for
the seccomp notifier in v1.26.0. The new feature allows us to
identify possibly malicious syscalls in our application, and therefore makes it
possible to verify profiles for consistency and completeness. Let's give that a
try.
First of all we need to run the latest main version of CRI-O, because v1.26.0
has not been released yet at time of writing. You can do that by either
compiling it from the source code or by using the pre-built binary
bundle via the get-script. The seccomp notifier feature of CRI-O is
guarded by an annotation, which has to be explicitly allowed, for example by
using a configuration drop-in like this:
If CRI-O is up and running, then it should indicate that the seccomp notifier is
available as well:
> sudo ./bin/crio --enable-metrics
…
INFO[…] Starting seccomp notifier watcher
INFO[…] Serving metrics on :9090 via HTTP
…
We also enable the metrics, because they provide additional telemetry data about
the notifier. Now we need a running Kubernetes cluster for demonstration
purposes. For this demo, we mainly stick to the
hack/local-up-cluster.sh approach to locally spawn a single node
Kubernetes cluster.
If everything is up and running, then we would have to define a seccomp profile
for testing purposes. But we do not have to create our own, we can just use the
RuntimeDefault profile which gets shipped with each container runtime. For
example the RuntimeDefault profile for CRI-O can be found in the
containers/common library.
Now we need a test container, which can be a simple nginx pod like
this:
Please note the annotation io.kubernetes.cri-o.seccompNotifierAction, which
enables the seccomp notifier for this workload. The value of the annotation can
be either stop for stopping the workload or anything else for doing nothing
else than logging and throwing metrics. Because of the termination we also use
the restartPolicy: Never to not automatically recreate the container on
failure.
Let's run the pod and check if it works:
> kubectl apply -f nginx.yaml
> kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 3m39s 10.85.0.3 127.0.0.1 <none> <none>
We can also test if the web server itself works as intended:
While everything is now up and running, CRI-O also indicates that it has started
the seccomp notifier:
…
INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb
…
If we would now run a forbidden syscall inside of the container, then we can
expect that the workload gets terminated. Let's give that a try by running
chroot in the containers namespaces:
> kubectl exec -it nginx -- bash
root@nginx:/# chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
root@nginx:/# command terminated with exit code 137
The exec session got terminated, so it looks like the container is not running
any more:
> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 seccomp killed 0 96s
Alright, the container got killed by seccomp, do we get any more information
about what was going on?
> kubectl describe pod nginx
Name: nginx
…
Containers:
nginx:
…
State: Terminated
Reason: seccomp killed
Message: Used forbidden syscalls: chroot (1x)
Exit Code: 137
Started: Mon, 14 Nov 2022 12:19:46 +0100
Finished: Mon, 14 Nov 2022 12:20:26 +0100
…
The seccomp notifier feature of CRI-O correctly set the termination reason and
message, including which forbidden syscall has been used how often (1x). How
often? Yes, the notifier gives the application up to 5 seconds after the last
seen syscall until it starts the termination. This means that it's possible to
catch multiple forbidden syscalls within one test by avoiding time-consuming
trial and errors.
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl describe pod nginx | grep Message
Message: Used forbidden syscalls: chroot (2x), swapoff (2x)
The CRI-O metrics will also reflect that:
> curl -sf localhost:9090/metrics | grep seccomp_notifier
# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1
How does it work in detail? CRI-O uses the chosen seccomp profile and injects
the action SCMP_ACT_NOTIFY instead of SCMP_ACT_ERRNO, SCMP_ACT_KILL,
SCMP_ACT_KILL_PROCESS or SCMP_ACT_KILL_THREAD. It also sets a local listener
path which will be used by the lower level OCI runtime (runc or crun) to create
the seccomp notifier socket. If the connection between the socket and CRI-O has
been established, then CRI-O will receive notifications for each syscall being
interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for
them to arrive and then terminates the container if the chosen
seccompNotifierAction=stop. Unfortunately, the seccomp notifier is not able to
notify on the defaultAction, which means that it's required to have
a list of syscalls to test for custom profiles. CRI-O does also state that
limitation in the logs:
INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
which means that syscalls using that default action can't be traced by the notifier
As a conclusion, the seccomp notifier implementation in CRI-O can be used to
verify if your applications behave correctly when using RuntimeDefault or any
other custom profile. Alerts can be created based on the metrics to create long
running test scenarios around that feature. Making seccomp understandable and
easier to use will increase adoption as well as help us to move towards a more
secure Kubernetes by default!
Thank you for reading this blog post. If you'd like to read more about the
seccomp notifier, checkout the following resources:
When speaking about observability in the cloud native space, then probably
everyone will mention OpenTelemetry (OTEL) at some point in the
conversation. That's great, because the community needs standards to rely on
for developing all cluster components into the same direction. OpenTelemetry
enables us to combine logs, metrics, traces and other contextual information
(called baggage) into a single resource. Cluster administrators or software
engineers can use this resource to get a viewport about what is going on in the
cluster over a defined period of time. But how can Kubernetes itself make use of
this technology stack?
Kubernetes consists of multiple components where some are independent and others
are stacked together. Looking at the architecture from a container runtime
perspective, then there are from the top to the bottom:
kube-apiserver: Validates and configures data for the API objects
kubelet: Agent running on each node
CRI runtime: Container Runtime Interface (CRI) compatible container runtime
like CRI-O or containerd
Linux kernel or Microsoft Windows: Underlying operating system
That means if we encounter a problem with running containers in Kubernetes, then
we start looking at one of those components. Finding the root cause for problems
is one of the most time consuming actions we face with the increased
architectural complexity from today's cluster setups. Even if we know the
component which seems to cause the issue, we still have to take the others into
account to maintain a mental timeline of events which are going on. How do we
achieve that? Well, most folks will probably stick to scraping logs, filtering
them and assembling them together over the components borders. We also have
metrics, right? Correct, but bringing metrics values in correlation with plain
logs makes it even harder to track what is going on. Some metrics are also not
made for debugging purposes. They have been defined based on the end user
perspective of the cluster for linking usable alerts and not for developers
debugging a cluster setup.
OpenTelemetry to the rescue: the project aims to combine signals such as
traces, metrics and logs together to maintain the
right viewport on the cluster state.
What is the current state of OpenTelemetry tracing in Kubernetes? From an API
server perspective, we have alpha support for tracing since Kubernetes v1.22,
which will graduate to beta in one of the upcoming releases. Unfortunately the
beta graduation has missed the v1.26 Kubernetes release. The design proposal can
be found in the API Server Tracing Kubernetes Enhancement Proposal
(KEP) which provides more information about it.
The kubelet tracing part is tracked in another KEP, which was
implemented in an alpha state in Kubernetes v1.25. A beta graduation is not
planned as time of writing, but more may come in the v1.27 release cycle.
There are other side-efforts going on beside both KEPs, for example klog is
considering OTEL support, which would boost the observability by
linking log messages to existing traces. Within SIG Instrumentation and SIG Node,
we're also discussing how to link the
kubelet traces together, because right now they're focused on the
gRPC calls between the kubelet and the CRI container runtime.
CRI-O features OpenTelemetry tracing support since v1.23.0 and is
working on continuously improving them, for example by attaching the logs to the
traces or extending the spans to logical parts of the
application. This helps users of the traces to gain the same
information like parsing the logs, but with enhanced capabilities of scoping and
filtering to other OTEL signals. The CRI-O maintainers are also working on a
container monitoring replacement for conmon, which is called
conmon-rs and is purely written in Rust. One benefit of
having a Rust implementation is to be able to add features like OpenTelemetry
support, because the crates (libraries) for those already exist. This allows a
tight integration with CRI-O and lets consumers see the most low level tracing
data from their containers.
The containerd folks added tracing support since v1.6.0, which is
available by using a plugin. Lower level OCI runtimes like
runc or crun feature no support for OTEL at all and it does not
seem to exist a plan for that. We always have to consider that there is a
performance overhead when collecting the traces as well as exporting them to a
data sink. I still think it would be worth an evaluation on how extended
telemetry collection could look like in OCI runtimes. Let's see if the Rust OCI
runtime youki is considering something like that in the future.
I'll show you how to give it a try. For my demo I'll stick to a stack with a single local node
that has runc, conmon-rs, CRI-O, and a kubelet. To enable tracing in the kubelet, I need to
apply the following KubeletConfiguration:
A samplingRatePerMillion equally to one million will internally translate to
sampling everything. A similar configuration has to be applied to CRI-O; I can
either start the crio binary with --enable-tracing and
--tracing-sampling-rate-per-million 1000000 or we use a drop-in configuration
like this:
To configure CRI-O to use conmon-rs, you require at least the latest CRI-O
v1.25.x and conmon-rs v0.4.0. Then a configuration drop-in like this can be used
to make CRI-O use conmon-rs:
cat /etc/crio/crio.conf.d/99-runtimes.conf
[crio.runtime]
default_runtime = "runc"
[crio.runtime.runtimes.runc]
runtime_type = "pod"
monitor_path = "/path/to/conmonrs" # or will be looked up in $PATH
That's it, the default configuration will point to an OpenTelemetry
collectorgRPC endpoint of localhost:4317, which has to be up and
running as well. There are multiple ways to run OTLP as described in the
docs, but it's also possible to kubectl proxy into an existing
instance running within Kubernetes.
If everything is set up, then the collector should log that there are incoming
traces:
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope go.opentelemetry.io/otel/sdk/tracer
Span #0
Trace ID : 71896e69f7d337730dfedb6356e74f01
Parent ID : a2a7714534c017e6
ID : 1d27dbaf38b9da8b
Name : github.com/cri-o/cri-o/server.(*Server).filterSandboxList
Kind : SPAN_KIND_INTERNAL
Start time : 2022-11-15 09:50:20.060325562 +0000 UTC
End time : 2022-11-15 09:50:20.060326291 +0000 UTC
Status code : STATUS_CODE_UNSET
Status message :
Span #1
Trace ID : 71896e69f7d337730dfedb6356e74f01
Parent ID : a837a005d4389579
ID : a2a7714534c017e6
Name : github.com/cri-o/cri-o/server.(*Server).ListPodSandbox
Kind : SPAN_KIND_INTERNAL
Start time : 2022-11-15 09:50:20.060321973 +0000 UTC
End time : 2022-11-15 09:50:20.060330602 +0000 UTC
Status code : STATUS_CODE_UNSET
Status message :
Span #2
Trace ID : fae6742709d51a9b6606b6cb9f381b96
Parent ID : 3755d12b32610516
ID : 0492afd26519b4b0
Name : github.com/cri-o/cri-o/server.(*Server).filterContainerList
Kind : SPAN_KIND_INTERNAL
Start time : 2022-11-15 09:50:20.0607746 +0000 UTC
End time : 2022-11-15 09:50:20.060795505 +0000 UTC
Status code : STATUS_CODE_UNSET
Status message :
Events:
SpanEvent #0
-> Name: log
-> Timestamp: 2022-11-15 09:50:20.060778668 +0000 UTC
-> DroppedAttributesCount: 0
-> Attributes::
-> id: Str(adf791e5-2eb8-4425-b092-f217923fef93)
-> log.message: Str(No filters were applied, returning full container list)
-> log.severity: Str(DEBUG)
-> name: Str(/runtime.v1.RuntimeService/ListContainers)
I can see that the spans have a trace ID and typically have a parent attached.
Events such as logs are part of the output as well. In the above case, the kubelet is
periodically triggering a ListPodSandbox RPC to CRI-O caused by the Pod
Lifecycle Event Generator (PLEG). Displaying those traces can be done via,
for example, Jaeger. When running the tracing stack locally, then a Jaeger
instance should be exposed on http://localhost:16686 per default.
The ListPodSandbox requests are directly visible within the Jaeger UI:
That's not too exciting, so I'll run a workload directly via kubectl:
kubectl run -it --rm --restart=Never --image=alpine alpine -- echo hi
hi
pod "alpine" deleted
Looking now at Jaeger, we can see that we have traces for conmonrs, crio as
well as the kubelet for the RunPodSandbox and CreateContainer CRI RPCs:
The kubelet and CRI-O spans are connected to each other to make investigation
easier. If we now take a closer look at the spans, then we can see that CRI-O's
logs are correctly accosted with the corresponding functionality. For example we
can extract the container user from the traces like this:
The lower level spans of conmon-rs are also part of this trace. For example
conmon-rs maintains an internal read_loop for handling IO between the
container and the end user. The logs for reading and writing bytes are part of
the span. The same applies to the wait_for_exit_code span, which tells us that
the container exited successfully with code 0:
Having all that information at hand side by side to the filtering capabilities
of Jaeger makes the whole stack a great solution for debugging container issues!
Mentioning the "whole stack" also shows the biggest downside of the overall
approach: Compared to parsing logs it adds a noticeable overhead on top of the
cluster setup. Users have to maintain a sink like Elasticsearch to
persist the data, expose the Jaeger UI and possibly take the performance
drawback into account. Anyways, it's still one of the best ways to increase the
observability aspect of Kubernetes.
Thank you for reading this blog post, I'm pretty sure we're looking into a
bright future for OpenTelemetry support in Kubernetes to make troubleshooting
simpler.
Authors: Adolfo García Veytia (Chainguard), Bob Killen (Google)
Starting with Kubernetes 1.25, our container image registry has changed from k8s.gcr.io to registry.k8s.io. This new registry spreads the load across multiple Cloud Providers & Regions, functioning as a sort of content delivery network (CDN) for Kubernetes container images. This change reduces the project’s reliance on a single entity and provides a faster download experience for a large number of users.
TL;DR: What you need to know about this change
Container images for Kubernetes releases from 1.25 onward are no longer published to k8s.gcr.io, only to registry.k8s.io.
In the upcoming December patch releases, the new registry domain default will be backported to all branches still in support (1.22, 1.23, 1.24).
If you run in a restricted environment and apply strict domain/IP address access policies limited to k8s.gcr.io, the image pulls will not function after the migration to this new registry. For these users, the recommended method is to mirror the release images to a private registry.
If you’d like to know more about why we made this change, or some potential issues you might run into, keep reading.
Why has Kubernetes changed to a different image registry?
k8s.gcr.io is hosted on a custom Google Container Registry (GCR) domain that was setup solely for the Kubernetes project. This has worked well since the inception of the project, and we thank Google for providing these resources, but today there are other cloud providers and vendors that would like to host images to provide a better experience for the people on their platforms. In addition to Google’s renewed commitment to donate $3 million to support the project's infrastructure, Amazon announced a matching donation during their Kubecon NA 2022 keynote in Detroit. This will provide a better experience for users (closer servers = faster downloads) and will reduce the egress bandwidth and costs from GCR at the same time. registry.k8s.io will spread the load between Google and Amazon, with other providers to follow in the future.
Why isn’t there a stable list of domains/IPs? Why can’t I restrict image pulls?
registry.k8s.io is a secure blob redirector that connects clients to the closest cloud provider. The nature of this change means that a client pulling an image could be redirected to any one of a large number of backends. We expect the set of backends to keep changing and will only increase as more and more cloud providers and vendors come on board to help mirror the release images.
Restrictive control mechanisms like man-in-the-middle proxies or network policies that restrict access to a specific list of IPs/domains will break with this change. For these scenarios, we encourage you to mirror the release images to a local registry that you have strict control over.
What kind of errors will I see? How will I know if I’m still using the old address?
Errors may depend on what kind of container runtime you are using, and what endpoint you are routed to, but it should present as a container failing to be created with the warning FailedCreatePodSandBox.
Below is an example error message showing a proxied deployment failing to pull due to an unknown certificate:
FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: Head “https://us-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8”: x509: certificate signed by unknown authority
I’m impacted by this change, how do I revert to the old registry address?
If using the new registry domain name is not an option, you can revert to the old domain name for cluster versions less than 1.25. Keep in mind that, eventually, you will have to switch to the new registry, as new image tags will no longer be pushed to GCR.
Reverting the registry name in kubeadm
The registry used by kubeadm to pull its images can be controlled by two methods:
Change is hard, and evolving our image-serving platform is needed to ensure a sustainable future for the project. We strive to make things better for everyone using Kubernetes. Many contributors from all corners of our community have been working long and hard to ensure we are making the best decisions possible, executing plans, and doing our best to communicate those plans.
Thanks to Aaron Crickenberger, Arnaud Meukam, Benjamin Elder, Caleb Woodbine, Davanum Srinivas, Mahamed Ali, and Tim Hockin from SIG K8s Infra, Brian McQueen, and Sergey Kanzhelev from SIG Node, Lubomir Ivanov from SIG Cluster Lifecycle, Adolfo García Veytia, Jeremy Rickard, Sascha Grunert, and Stephen Augustus from SIG Release, Bob Killen and Kaslin Fields from SIG Contribex, Tim Allclair from the Security Response Committee. Also a big thank you to our friends acting as liaisons with our cloud provider partners: Jay Pipes from Amazon and Jon Johnson Jr. from Google.
Change is an integral part of the Kubernetes life-cycle: as Kubernetes grows and matures, features may be deprecated, removed, or replaced with improvements for the health of the project. For Kubernetes v1.26 there are several planned: this article identifies and describes some of them, based on the information available at this mid-cycle point in the v1.26 release process, which is still ongoing and can introduce additional changes.
The Kubernetes API Removal and Deprecation process
The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API is one that has been marked for removal in a future Kubernetes release; it will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.
Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
Beta or pre-release API versions must be supported for 3 releases after deprecation.
Alpha or experimental API versions may be removed in any release without prior deprecation notice.
Whether an API is removed as a result of a feature graduating from beta to stable or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the documentation.
A note about the removal of the CRI v1alpha2 API and containerd 1.5 support
Following the adoption of the Container Runtime Interface (CRI) and the [removal of dockershim] in v1.24 , the CRI is the supported and documented way through which Kubernetes interacts withdifferent container runtimes. Each kubelet negotiates which version of CRI to use with the container runtime on that node.
The Kubernetes project recommends using CRI version v1; in Kubernetes v1.25 the kubelet can also negotiate the use of CRI v1alpha2 (which was deprecated along at the same time as adding support for the stable v1 interface).
Kubernetes v1.26 will not support CRI v1alpha2. That removal will result in the kubelet not registering the node if the container runtime doesn't support CRI v1. This means that containerd minor version 1.5 and older will not be supported in Kubernetes 1.26; if you use containerd, you will need to upgrade to containerd version 1.6.0 or later before you upgrade that node to Kubernetes v1.26. Other container runtimes that only support the v1alpha2 are equally affected: if that affects you, you should contact the container runtime vendor for advice or check their website for additional instructions in how to move forward.
If you want to benefit from v1.26 features and still use an older container runtime, you can run an older kubelet. The supported skew for the kubelet allows you to run a v1.25 kubelet, which still is still compatible with v1alpha2 CRI support, even if you upgrade the control plane to the 1.26 minor release of Kubernetes.
As well as container runtimes themselves, that there are tools like stargz-snapshotter that act as a proxy between kubelet and container runtime and those also might be affected.
Deprecations and removals in Kubernetes v1.26
In addition to the above, Kubernetes v1.26 is targeted to include several additional removals and deprecations.
Removal of the v1beta1 flow control API group
The flowcontrol.apiserver.k8s.io/v1beta1 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.26. Users should migrate manifests and API clients to use the flowcontrol.apiserver.k8s.io/v1beta2 API version, available since v1.23.
Removal of the v2beta2 HorizontalPodAutoscaler API
The autoscaling/v2beta2 API version of HorizontalPodAutoscaler will no longer be served in v1.26. Users should migrate manifests and API clients to use the autoscaling/v2 API version, available since v1.23.
Removal of in-tree credential management code
In this upcoming release, legacy vendor-specific authentication code that is part of Kubernetes
will be removed from both
client-go and kubectl.
The existing mechanism supports authentication for two specific cloud providers:
Azure and Google Cloud.
In its place, Kubernetes already offers a vendor-neutral
authentication plugin mechanism -
you can switch over right now, before the v1.26 release happens.
If you're affected, you can find additional guidance on how to proceed for
Azure and for
Google Cloud.
Removal of kube-proxy userspace modes
The userspace proxy mode, deprecated for over a year, is no longer supported on either Linux or Windows and will be removed in this release. Users should use iptables or ipvs on Linux, or kernelspace on Windows: using --mode userspace will now fail.
Removal of in-tree OpenStack cloud provider
Kubernetes is switching from in-tree code for storage integrations, in favor of the Container Storage Interface (CSI).
As part of this, Kubernetes v1.26 will remove the the deprecated in-tree storage integration for OpenStack
(the cinder volume type). You should migrate to external cloud provider and CSI driver from
https://github.com/kubernetes/cloud-provider-openstack instead.
For more information, visit Cinder in-tree to CSI driver migration.
Removal of the GlusterFS in-tree driver
The in-tree GlusterFS driver was deprecated in v1.25, and will be removed from Kubernetes v1.26.
Deprecation of non-inclusive kubectl flag
As part of the implementation effort of the Inclusive Naming Initiative,
the --prune-whitelist flag will be deprecated, and replaced with --prune-allowlist.
Users that use this flag are strongly advised to make the necessary changes prior to the final removal of the flag, in a future release.
Removal of dynamic kubelet configuration
Dynamic kubelet configuration allowed new kubelet configurations to be rolled out via the Kubernetes API, even in a live cluster.
A cluster operator could reconfigure the kubelet on a Node by specifying a ConfigMap
that contained the configuration data that the kubelet should use.
Dynamic kubelet configuration was removed from the kubelet in v1.24, and will be
removed from the API server in the v1.26 release.
Deprecations for kube-apiserver command line arguments
The --master-service-namespace command line argument to the kube-apiserver doesn't have
any effect, and was already informally deprecated.
That command line argument wil be formally marked as deprecated in v1.26, preparing for its
removal in a future release.
The Kubernetes project does not expect any impact from this deprecation and removal.
Deprecations for kubectl run command line arguments
Several unused option arguments for the kubectl run subcommand will be marked as deprecated, including:
--cascade
--filename
--force
--grace-period
--kustomize
--recursive
--timeout
--wait
These arguments are already ignored so no impact is expected: the explicit deprecation sets a warning message and prepares the removal of the argumentsin a future release.
Removal of legacy command line arguments relating to logging
This blog post was inspired by a previous Kubernetes blog post about
Advanced Server Side Apply.
The author of said blog post listed multiple benefits for applications and
controllers when switching to server-side apply (from now on abbreviated with
SSA). Especially the chapter about
CI/CD systems
motivated me to respond and write down my thoughts and experiences.
These thoughts and experiences are the results of me working on Kluctl
for the past 2 years. I describe Kluctl as "The missing glue to put together
large Kubernetes deployments, composed of multiple smaller parts
(Helm/Kustomize/...) in a manageable and unified way."
To get a basic understanding of Kluctl, I suggest to visit the kluctl.io
website and read through the documentation and tutorials, for example the
microservices demo tutorial.
As an alternative, you can watch Hands-on Introduction to kluctl
from the Rawkode Academy YouTube channel which shows a hands-on demo session.
One of the main philosophies that Kluctl follows is "live and let live",
meaning that it will try its best to work in conjunction with any other tool or
controller running outside or inside your clusters. Kluctl will not overwrite
any fields that it lost ownership of, unless you explicitly tell it to do so.
Achieving this would not have been possible (or at least several magnitudes
harder) without the use of SSA. Server-side apply allows Kluctl
to detect when ownership for a field got lost, for example when another controller
or operator updates that field to another value. Kluctl can then decide on a
field-by-field basis if force-applying is required before retrying based on these
decisions.
The days before SSA
The first versions of Kluctl were based on shelling out to kubectl and thus
implicitly relied on client-side apply. At that time, SSA was
still alpha and quite buggy. And to be honest, I didn't even know it was a
thing at that time.
The way client-side apply worked had some serious drawbacks. The most obvious one
(it was guaranteed that you'd stumble on this by yourself if enough time passed)
is that it relied on an annotation (kubectl.kubernetes.io/last-applied-configuration)
being added to the object, bringing in all the limitations and issues with huge
annotation values. A good example of such issues are
CRDs being so large,
that they don't fit into the annotation's value anymore.
Another drawback can be seen just by looking at the name (client-side apply).
Being client side means that each client has to provide the apply-logic on
its own, which at that time was only properly implemented inside kubectl,
making it hard to be replicated inside controllers.
This added kubectl as a dependency (either as an executable or in the form of
Go packages) to all controllers that wanted to leverage the apply-logic.
However, even if one managed to get client-side apply running from inside a
controller, you ended up with a solution that gave no control over how it
worked internally. As an example, there was no way to individually decide which
fields to overwrite in case of external changes and which ones to let go.
Discovering SSA apply
I was never happy with the solution described above and then somehow stumbled
across server-side apply,
which was still in beta at that time. Experimenting with it via
kubectl apply --server-side revealed immediately that the true power of
SSA can not be easily leveraged by shelling out to kubectl.
The way SSA is implemented in kubectl does not allow enough
control over conflict resolution as it can only switch between
"not force-applying anything and erroring out" and "force-applying everything
without showing any mercy!".
The API documentation however made it clear that SSA is able to
control conflict resolution on field level, simply by choosing which fields
to include and which fields to omit from the supplied object.
Moving away from kubectl
This meant that Kluctl had to move away from shelling out to kubectl first. Only
after that was done, I would have been able to properly implement SSA
with its powerful conflict resolution.
To achieve this, I first implemented access to the target clusters via a
Kubernetes client library. This had the nice side effect of dramatically
speeding up Kluctl as well. It also improved the security and usability of
Kluctl by ensuring that a running Kluctl command could not be messed around
with by externally modifying the kubeconfig while it was running.
Implementing SSA
After switching to a Kubernetes client library, leveraging SSA
felt easy. Kluctl now has to send each manifest to the API server as part of a
PATCH request, which signals
that Kluctl wants to perform a SSA operation. The API server then
responds with an OK response (HTTP status code 200), or with a Conflict response
(HTTP status 409).
In case of a Conflict response, the body of that response includes machine-readable
details about the conflicts. Kluctl can then use these details to figure out
which fields are in conflict and which actors (field managers) have taken
ownership of the conflicted fields.
Then, for each field, Kluctl will decide if the conflict should be ignored or
if it should be force-applied. If any field needs to be force-applied, Kluctl
will retry the apply operation with the ignored fields omitted and the force
flag being set on the API call.
In case a conflict is ignored, Kluctl will issue a warning to the user so that
the user can react properly (or ignore it forever...).
That's basically it. That is all that is required to leverage SSA.
Big thanks and thumbs-up to the Kubernetes developers who made this possible!
Conflict Resolution
Kluctl has a few simple rules to figure out if a conflict should be ignored
or force-applied.
It first checks the field's actor (the field manager) against a list of known
field manager strings from tools that are frequently used to perform manual modifications. These
are for example kubectl and k9s. Any modifications performed with these tools
are considered "temporary" and will be overwritten by Kluctl.
If you're using Kluctl along with kubectl where you don't want the changes from
kubectl to be overwritten (for example, using in a script) then you can specify
--field-manager=<manager-name> on the command line to kubectl, and Kluctl
doesn't apply its special heuristic.
If the field manager is not known by Kluctl, it will check if force-applying is
requested for that field. Force-applying can be requested in different ways:
By passing --force-apply to Kluctl. This will cause ALL fields to be force-applied on conflicts.
By adding the kluctl.io/force-apply=true annotation to the object in question. This will cause all fields of that object to be force-applied on conflicts.
By adding the kluctl.io/force-apply-field=my.json.path annotation to the object in question. This causes only fields matching the JSON path to be force-applied on conflicts.
Marking a field to be force-applied is required whenever some other actor is
known to erroneously claim fields (the ECK operator does this to the nodeSets
field for example), you can ensure that Kluctl always overwrites these fields
to the original or a new value.
In the future, Kluctl will allow even more control about conflict resolution.
For example, the CLI will allow to control force-applying on field level.
DevOps vs Controllers
So how does SSA in Kluctl lead to "live and let live"?
It allows the co-existence of classical pipelines (e.g. Github Actions or
Gitlab CI), controllers (e.g. the HPA controller or GitOps style controllers)
and even admins running deployments from their local machines.
Wherever you are on your infrastructure automation journey, Kluctl has a place
for you. From running deployments using a script on your PC, all the way to
fully automated CI/CD with the pipelines themselves defined in code, Kluctl
aims to complement the workflow that's right for you.
And even after fully automating everything, you can intervene with your admin
permissions if required and run a kubectl command that will modify a field
and prevent Kluctl from overwriting it. You'd just have to switch to a
field-manager (e.g. "admin-override") that is not overwritten by Kluctl.
A few takeaways
Server-side apply is a great feature and essential for the future of
controllers and tools in Kubernetes. The amount of controllers involved
will only get more and proper modes of working together are a must.
I believe that CI/CD-related controllers and tools should leverage
SSA to perform proper conflict resolution. I also believe that
other controllers (e.g. Flux and ArgoCD) would benefit from the same kind
of conflict resolution control on field-level.
It might even be a good idea to come together and work on a standardized
set of annotations to control conflict resolution for CI/CD-related tooling.
On the other side, non CI/CD-related controllers should ensure that they don't
cause unnecessary conflicts when modifying objects. As of
the server-side apply documentation,
it is strongly recommended for controllers to always perform force-applying. When
following this recommendation, controllers should really make sure that only
fields related to the controller are included in the applied object.
Otherwise, unnecessary conflicts are guaranteed.
In many cases, controllers are meant to only modify the status subresource
of the objects they manage. In this case, controllers should only patch the
status subresource and not touch the actual object. If this is followed,
conflicts become impossible to occur.
If you are a developer of such a controller and unsure about your controller
adhering to the above, simply try to retrieve an object managed by your
controller and look at the managedFields (you'll need to pass
--show-managed-fields -oyaml to kubectl get) to see if some field got
claimed unexpectedly.
Server-side apply (SSA) has now
been GA for a few releases, and I
have found myself in a number of conversations, recommending that people / teams
in various situations use it. So I’d like to write down some of those reasons.
Obvious (and not-so-obvious) benefits of SSA
A list of improvements / niceties you get from switching from various things to
Server-side apply!
Versus client-side-apply (that is, plain kubectl apply):
The system gives you conflicts when you accidentally fight with another
actor over the value of a field!
When combined with --dry-run, there’s no chance of accidentally running a
client-side dry run instead of a server side dry run.
Versus hand-rolling patches:
The SSA patch format is extremely natural to write, with no weird syntax.
It’s just a regular object, but you can (and should) omit any field you
don’t care about.
The old patch format (“strategic merge patch”) was ad-hoc and still has some
bugs; JSON-patch and JSON merge-patch fail to handle some cases that are
common in the Kubernetes API, namely lists with items that should be
recursively merged based on a “name” or other identifying field.
You can use SSA to explicitly delete fields you don’t “own” by setting them
to null, which makes it a feature-complete replacement for all of the old
patch formats.
Versus shelling out to kubectl:
You can use the apply API call from any language without shelling out to
kubectl!
(This one is more complicated and you can skip it if you've never written a
controller!)
To use GET-modify-PUT correctly, you have to handle and retry a write
failure in the case that someone else has modified the object in any way
between your GET and PUT. This is an “optimistic concurrency failure” when
it happens.
SSA offloads this task to the server– you only have to retry if there’s a
conflict, and the conflicts you can get are all meaningful, like when you’re
actually trying to take a field away from another actor in the system.
To put it another way, if 10 actors do a GET-modify-PUT cycle at the same
time, 9 will get an optimistic concurrency failure and have to retry, then
8, etc, for up to 50 total GET-PUT attempts in the worst case (that’s .5N^2
GET and PUT calls for N actors making simultaneous changes). If the actors
are using SSA instead, and the changes don’t actually conflict over specific
fields, then all the changes can go in in any order. Additionally, SSA
changes can often be done without a GET call at all. That’s only N apply
requests for N actors, which is a drastic improvement!
How can I use SSA?
Users
Use kubectl apply --server-side! Soon we (SIG API Machinery) hope to make this
the default and remove the “client side” apply completely!
Controller authors
There’s two main categories here, but for both of them, you should probably
force conflicts when using SSA. This is because your controller probably
doesn’t know what to do when some other entity in the system has a different
desire than your controller about a particular field. (See the CI/CD
section, though!)
Controllers that use either a GET-modify-PUT sequence or a PATCH
This kind of controller GETs an object (possibly from a
watch),
modifies it, and then PUTs it back to write its changes. Sometimes it constructs
a custom PATCH, but the semantics are the same. Most existing controllers
(especially those in-tree) work like this.
If your controller is perfect, great! You don’t need to change it. But if you do
want to change it, you can take advantage of the new client library’s extract
workflow– that is, get the existing object, extract your existing desires,
make modifications, and re-apply. For many controllers that were computing
the smallest API changes possible, this will be a minor update to the existing
implementation.
This workflow avoids the failure mode of accidentally trying to own every field
in the object, which is what happens if you just GET the object, make changes,
and then apply. (Note that the server will notice you did this and reject
your change!)
Reconstructive controllers
This kind of controller wasn't really possible prior to SSA. The idea here is to
(whenever something changes etc) reconstruct from scratch the fields of the
object as the controller wishes them to be, and then apply the change to the
server, letting it figure out the result. I now recommend that new controllers
start out this way–it's less fiddly to say what you want an object to look like
than it is to say how you want it to change.
The client library supports this method of operation by default.
The only downside is that you may end up sending unneeded apply requests to
the API server, even if actually the object already matches your controller’s
desires. This doesn't matter if it happens once in a while, but for extremely
high-throughput controllers, it might cause a performance problem for the
cluster–specifically, the API server. No-op writes are not written to storage
(etcd) or broadcast to any watchers, so it’s not really that big of a deal. If
you’re worried about this anyway, today you could use the method explained in
the previous section, or you could still do it this way for now, and wait for an
additional client-side mechanism to suppress zero-change applies.
To get around this downside, why not GET the object and only send your apply
if the object needs it? Surprisingly, it doesn't help much – a no-op apply is
not very much more work for the API server than an extra GET; and an apply
that changes things is cheaper than that same apply with a preceding GET.
Worse, since it is a distributed system, something could change between your GET
and apply, invalidating your computation. Instead, you can use this
optimization on an object retrieved from a cache–then it legitimately will
reduce load on the system (at the cost of a delay when a change is needed and
the cache is a bit behind).
CI/CD systems
Continuous integration (CI) and/or continuous deployment (CD) systems are a
special kind of controller which is doing something like reading manifests from
source control (such as a Git repo) and automatically pushing them into the
cluster. Perhaps the CI / CD process first generates manifests from a template,
then runs some tests, and then deploys a change. Typically, users are the
entities pushing changes into source control, although that’s not necessarily
always the case.
Some systems like this continuously reconcile with the cluster, others may only
operate when a change is pushed to the source control system. The following
considerations are important for both, but more so for the continuously
reconciling kind.
CI/CD systems are literally controllers, but for the purpose of apply, they
are more like users, and unlike other controllers, they need to pay attention to
conflicts. Reasoning:
Abstractly, CI/CD systems can change anything, which means they could conflict
with any controller out there. The recommendation that controllers force
conflicts is assuming that controllers change a limited number of things and
you can be reasonably sure that they won’t fight with other controllers about
those things; that’s clearly not the case for CI/CD controllers.
Concrete example: imagine the CI/CD system wants .spec.replicas for some
Deployment to be 3, because that is the value that is checked into source
code; however there is also a HorizontalPodAutoscaler (HPA) that targets the
same deployment. The HPA computes a target scale and decides that there should
be 10 replicas. Which should win? I just said that most controllers–including
the HPA–should ignore conflicts. The HPA has no idea if it has been enabled
incorrectly, and the HPA has no convenient way of informing users of errors.
The other common cause of a CI/CD system getting a conflict is probably when
it is trying to overwrite a hot-fix (hand-rolled patch) placed there by a
system admin / SRE / dev-on-call. You almost certainly don’t want to override
that automatically.
Of course, sometimes SRE makes an accidental change, or a dev makes an
unauthorized change – those you do want to notice and overwrite; however, the
CI/CD system can’t tell the difference between these last two cases.
Hopefully this convinces you that CI/CD systems need error paths–a way to
back-propagate these conflict errors to humans; in fact, they should have this
already, certainly continuous integration systems need some way to report that
tests are failing. But maybe I can also say something about how humans can
deal with errors:
Reject the hotfix: the (human) administrator of the CI/CD system observes the
error, and manually force-applies the manifest in question. Then the CI/CD
system will be able to apply the manifest successfully and become a co-owner.
Optional: then the administrator applies a blank manifest (just the object
type / namespace / name) to relinquish any fields they became a manager for.
if this step is omitted, there's some chance the administrator will end up
owning fields and causing an unwanted future conflict.
Note: why an administrator? I'm assuming that developers which ordinarily
push to the CI/CD system and / or its source control system may not have
permissions to push directly to the cluster.
Accept the hotfix: the author of the change in question sees the conflict, and
edits their change to accept the value running in production.
Accept then reject: as in the accept option, but after that manifest is
applied, and the CI/CD queue owns everything again (so no conflicts), re-apply
the original manifest.
I can also imagine the CI/CD system permitting you to mark a manifest as
“force conflicts” somehow– if there’s demand for this we could consider making
a more standardized way to do this. A rigorous version of this which lets you
declare exactly which conflicts you intend to force would require support from
the API server; in lieu of that, you can make a second manifest with only that
subset of fields.
Future work: we could imagine an especially advanced CI/CD system that could
parse metadata.managedFields data to see who or what they are conflicting
with, over what fields, and decide whether or not to ignore the conflict. In
fact, this information is also presented in any conflict errors, though
perhaps not in an easily machine-parseable format. We (SIG API Machinery)
mostly didn't expect that people would want to take this approach — so we
would love to know if in fact people want/need the features implied by this
approach, such as the ability, when applying to request to override
certain conflicts but not others.
If this sounds like an approach you'd want to take for your own controller,
come talk to SIG API Machinery!
Craig Ingram has graciously attempted over the years to keep track of the
status of the findings reported in the last audit in this issue:
kubernetes/kubernetes#81146.
This blog post will attempt to dive deeper into this, address any gaps
in tracking and become a point in time summary of the state of the
findings reported from 2019.
This article should also help readers gain confidence through transparent
communication, of work done by the community to address these findings and
bubble up any findings that need help from community contributors.
Current State
The status of each issue / finding here is represented in a best effort manner.
Authors do not claim to be 100% accurate on the status and welcome any
corrections or feedback if the current state is not reflected accurately by
commenting directly on the relevant issue.
Apart from fixes to the specific issues, the 2019 third party security audit
also motivated security focussed enhancements in the next few releases of
Kubernetes. One such example is
Kubernetes Enhancement Proposal (KEP) 1933 Defend Against Logging Secrets via Static Analysis to prevent exposing
secrets to logs with Patrick Rhomberg driving the
implementation. As a result of this KEP,
go-flow-levee, a taint propagation
analysis tool configured to detect logging of secrets, is executed in a
script
as a Prow presubmit job. This KEP was introduced in v1.20.0 as an alpha
feature, then graduated to beta in v1.21.0, and graduated to stable in
v1.23.0. As stable, the analysis runs as a blocking presubmit test. This
KEP also helped resolve the following issues from the 2019 third party security audit:
Many of the 37 findings identified were fixed by work from
our community members over the last 3 years. However, we still have some work
left to do. Here's a breakdown of remaining work with rough estimates on
time commitment, complexity and benefits to the ecosystem on fixing
these pending issues.
Note: Anything requiring a KEP (Kubernetes Enhancement Proposal) is considered
high time commitment and high complexity. Benefits to Ecosystem are
roughly equivalent to risk of keeping the finding unfixed which is
determined by Severity Level + Likelihood of a successful vulnerability
exploit. These estimates and values in the table below are the authors'
personal opinion. An individual or end users' threat model may rate the
benefits to fix a particular issue higher or lower.
Title
Issue
Time Commitment
Complexity
Benefit to Ecosystem
Kubernetes does not facilitate certificate revocation
To get started on fixing any of these findings that need help, please
consider getting involved in Kubernetes SIG
Security
by joining our bi-weekly meetings or hanging out with us on our Slack
Channel.
Authors: Abdullah Gharaibeh (Google), Aldo Culquicondor (Google)
Whether on-premises or in the cloud, clusters face real constraints for resource usage, quota, and cost management reasons. Regardless of the autoscalling capabilities, clusters have finite capacity. As a result, users want an easy way to fairly and
efficiently share resources.
In this article, we introduce Kueue,
an open source job queueing controller designed to manage batch jobs as a single unit.
Kueue leaves pod-level orchestration to existing stable components of Kubernetes.
Kueue natively supports the Kubernetes Job
API and offers hooks for integrating other custom-built APIs for batch jobs.
Why Kueue?
Job queueing is a key feature to run batch workloads at scale in both on-premises and cloud environments. The main goal
of job queueing is to manage access to a limited pool of resources shared by multiple tenants. Job queueing decides which
jobs should wait, which can start immediately, and what resources they can use.
Some of the most desired job queueing requirements include:
Quota and budgeting to control who can use what and up to what limit. This is not only needed in clusters with static resources like on-premises,
but it is also needed in cloud environments to control spend or usage of scarce resources.
Fair sharing of resources between tenants. To maximize the usage of available resources, any unused quota assigned to inactive tenants should be
allowed to be shared fairly between active tenants.
Flexible placement of jobs across different resource types based on availability. This is important in cloud environments which have heterogeneous
resources such as different architectures (GPU or CPU models) and different provisioning modes (spot vs on-demand).
Support for autoscaled environments where resources can be provisioned on demand.
Plain Kubernetes doesn't address the above requirements. In normal circumstances, once a Job is created, the job-controller instantly creates the
pods and kube-scheduler continuously attempts to assign the pods to nodes. At scale, this situation can work the control plane to death. There is
also currently no good way to control at the job level which jobs should get which resources first, and no way to express order or fair sharing. The
current ResourceQuota model is not a good fit for these needs because quotas are enforced on resource creation, and there is no queueing of requests. The
intent of ResourceQuotas is to provide a builtin reliability mechanism with policies needed by admins to protect clusters from failing over.
In the Kubernetes ecosystem, there are several solutions for job scheduling. However, we found that these alternatives have one or more of the following problems:
They replace existing stable components of Kubernetes, like kube-scheduler or the job-controller. This is problematic not only from an operational point of view, but
also the duplication in the job APIs causes fragmentation of the ecosystem and reduces portability.
They don't integrate with autoscaling, or
They lack support for resource flexibility.
How Kueue works
With Kueue we decided to take a different approach to job queueing on Kubernetes that is anchored around the following aspects:
Not duplicating existing functionalities already offered by established Kubernetes components for pod scheduling, autoscaling and job
lifecycle management.
Adding key features that are missing to existing components. For example, we invested in the Job API to cover more use cases like
IndexedJob and fixed long standing issues related to pod
tracking. While this path takes longer to
land features, we believe it is the more sustainable long term solution.
Ensuring compatibility with cloud environments where compute resources are elastic and heterogeneous.
For this approach to be feasible, Kueue needs knobs to influence the behavior of those established components so it can effectively manage
when and where to start a job. We added those knobs to the Job API in the form of two features:
Suspend field, which allows Kueue to signal to the job-controller
when to start or stop a Job.
Mutable scheduling directives, which allows Kueue to
update a Job's .spec.template.spec.nodeSelector before starting the Job. This way, Kueue can control Pod placement while still
delegating to kube-scheduler the actual pod-to-node scheduling.
Note that any custom job API can be managed by Kueue if that API offers the above two capabilities.
Resource model
Kueue defines new APIs to address the requirements mentioned at the beginning of this post. The three main APIs are:
ResourceFlavor: a cluster-scoped API to define resource flavor available for consumption, like a GPU model. At its core, a ResourceFlavor is
a set of labels that mirrors the labels on the nodes that offer those resources.
ClusterQueue: a cluster-scoped API to define resource pools by setting quotas for one or more ResourceFlavor.
LocalQueue: a namespaced API for grouping and managing single tenant jobs. In its simplest form, a LocalQueue is a pointer to the ClusterQueue
that the tenant (modeled as a namespace) can use to start their jobs.
For more details, take a look at the API concepts documentation. While the three APIs may look overwhelming,
most of Kueue’s operations are centered around ClusterQueue; the ResourceFlavor and LocalQueue APIs are mainly organizational wrappers.
Example use case
Imagine the following setup for running batch workloads on a Kubernetes cluster on the cloud:
You have cluster-autoscaler installed in the cluster to automatically
adjust the size of your cluster.
There are two types of autoscaled node groups that differ on their provisioning policies: spot and on-demand. The nodes of each group are
differentiated by the label instance-type=spot or instance-type=ondemand.
Moreover, since not all Jobs can tolerate running on spot nodes, the nodes are tainted with spot=true:NoSchedule.
To strike a balance between cost and resource availability, imagine you want Jobs to use up to 1000 cores of on-demand nodes, then use up to
2000 cores of spot nodes.
As an admin for the batch system, you define two ResourceFlavors that represent the two types of nodes:
Note that the order of flavors in the ClusterQueue resources matters: Kueue will attempt to fit jobs in the available quotas according to
the order unless the job has an explicit affinity to specific flavors.
For each namespace, you define a LocalQueue that points to the ClusterQueue above:
Admins create the above setup once. Batch users are able to find the queues they are allowed to
submit to by listing the LocalQueues in their namespace(s). The command is similar to the following: kubectl get -n my-namespace localqueues
To submit work, create a Job and set the kueue.x-k8s.io/queue-name annotation as follows:
Kueue intervenes to suspend the Job as soon as it is created. Once the Job is at the head of the ClusterQueue, Kueue evaluates if it can start
by checking if the resources requested by the job fit the available quota.
In the above example, the Job tolerates spot resources. If there are previously admitted Jobs consuming all existing on-demand quota but
not all of spot’s, Kueue admits the Job using the spot quota. Kueue does this by issuing a single update to the Job object that:
Changes the .spec.suspend flag to false
Adds the term instance-type: spot to the job's .spec.template.spec.nodeSelector so that when the pods are created by the job controller, those pods can only schedule
onto spot nodes.
Finally, if there are available empty nodes with matching node selector terms, then kube-scheduler will directly schedule the pods. If not, then
kube-scheduler will initially mark the pods as unschedulable, which will trigger the cluster-autoscaler to provision new nodes.
Future work and getting involved
The example above offers a glimpse of some of Kueue's features including support for quota, resource flexibility, and integration with cluster
autoscaler. Kueue also supports fair-sharing, job priorities, and different queueing strategies. Take a look at the
Kueue documentation to learn more about those features and how to use Kueue.
We have a number of features that we plan to add to Kueue, such as hierarchical quota, budgets, and support for dynamically sized jobs. In
the more immediate future, we are focused on adding support for job preemption.
The latest Kueue release is available on Github;
try it out if you run batch workloads on Kubernetes (requires v1.22 or newer).
We are in the early stages of this project and we are seeking feedback of all levels, major or minor, so please don’t hesitate to reach out. We’re
also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation. You can get in touch with
us via our repo, mailing list or on
Slack.
Last but not least, thanks to all our contributors who made this project possible!
Authors: Rodrigo Campos (Microsoft), Giuseppe Scrivano (Red Hat)
Kubernetes v1.25 introduces the support for user namespaces.
This is a major improvement for running secure workloads in
Kubernetes. Each pod will have access only to a limited subset of the
available UIDs and GIDs on the system, thus adding a new security
layer to protect from other pods running on the same system.
How does it work?
A process running on Linux can use up to 4294967296 different UIDs and
GIDs.
User namespaces is a Linux feature that allows mapping a set of users
in the container to different users in the host, thus restricting what
IDs a process can effectively use.
Furthermore, the capabilities granted in a new user namespace do not
apply in the host initial namespaces.
Why is it important?
There are mainly two reasons why user namespaces are important:
improve security since they restrict the IDs a pod can use, so each
pod can run in its own separate environment with unique IDs.
enable running workloads as root in a safer manner.
In a user namespace we can map the root user inside the pod to a
non-zero ID outside the container, containers believe in running as
root while they are a regular unprivileged ID from the host point of
view.
The process can keep capabilities that are usually restricted to
privileged pods and do it in a safe way since the capabilities granted
in a new user namespace do not apply in the host initial namespaces.
How do I enable user namespaces?
At the moment, user namespaces support is opt-in, so you must enable
it for a pod setting hostUsers to false under the pod spec stanza:
Immutable fields can be found in a few places in the built-in Kubernetes types.
For example, you can't change the .metadata.name of an object. Specific objects
have fields where changes to existing objects are constrained; for example, the
.spec.selector of a Deployment.
Aside from simple immutability, there are other common design patterns such as
lists which are append-only, or a map with mutable values and immutable keys.
Until recently the best way to restrict field mutability for CustomResourceDefinitions
has been to create a validating
admission webhook:
this means a lot of complexity for the common case of making a field immutable.
Beta since Kubernetes 1.25, CEL Validation Rules allow CRD authors to express
validation constraints on their fields using a rich expression language,
CEL. This article explores how you can
use validation rules to implement a few common immutability patterns directly in
the manifest for a CRD.
Basics of validation rules
The new support for CEL validation rules in Kubernetes allows CRD authors to add
complicated admission logic for their resources without writing any code!
For example, A CEL rule to constrain a field maximumSize to be greater than a
minimumSize for a CRD might look like the following:
rule: |
self.maximumSize > self.minimumSize
message: 'Maximum size must be greater than minimum size.'
The rule field contains an expression written in CEL. self is a special keyword
in CEL which refers to the object whose type contains the rule.
The message field is an error message which will be sent to Kubernetes clients
whenever this particular rule is not satisfied.
For more details about the capabilities and limitations of Validation Rules using
CEL, please refer to
validation rules.
The CEL specification is also a good
reference for information specifically related to the language.
Immutability patterns with CEL validation rules
This section implements several common use cases for immutability in Kubernetes
CustomResourceDefinitions, using validation rules expressed as
kubebuilder marker comments.
Resultant OpenAPI generated by the kubebuilder marker comments will also be
included so that if you are writing your CRD manifests by hand you can still
follow along.
Project setup
To use CEL rules with kubebuilder comments, you first need to set up a Golang
project structure with the CRD defined in Go.
You may skip this step if you are not using kubebuilder or are only interested
in the resultant OpenAPI extensions.
Begin with a folder structure of a Go module set up like the following. If
you have your own project already set up feel free to adapt this tutorial to your liking:
types.go contains all type definitions in stable.example.com/v1
package v1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// An empty CRD as an example of defining a type using controller tools
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
type TestCRD struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec TestCRDSpec `json:"spec,omitempty"`
Status TestCRDStatus `json:"status,omitempty"`
}
type TestCRDStatus struct {}
type TestCRDSpec struct {
// You will fill this in as you go along
}
tools.go contains a dependency on controller-gen which will be used to generate the CRD definition:
//go:build tools
package celimmutabilitytutorial
// Force direct dependency on code-generator so that it may be executed with go run
import (
_ "sigs.k8s.io/controller-tools/cmd/controller-gen"
)
Finally, generate.gocontains a go:generate directive to make use of
controller-gen. controller-gen parses our types.go and creates generates
CRD yaml files into a crd folder:
package celimmutabilitytutorial
//go:generate go run sigs.k8s.io/controller-tools/cmd/controller-gen crd paths=./pkg/apis/... output:dir=./crds
You may now want to add dependencies for our definitions and test the code generation:
cd cel-immutability-tutorial
go mod init <your-org>/<your-module-name>
go mod tidy
go generate ./...
After running these commands you now have completed the basic project structure.
Your folder tree should look like the following:
The manifest for the example CRD is now available in crds/stable.example.com_testcrds.yaml.
Immutablility after first modification
A common immutability design pattern is to make the field immutable once it has
been first set. This example will throw a validation error if the field after
changes after being first initialized.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type ImmutableSinceFirstWrite struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
// +kubebuilder:validation:MaxLength=512
Value string `json:"value"`
}
The +kubebuilder directives in the comments inform controller-gen how to
annotate the generated OpenAPI. The XValidation rule causes the rule to appear
among the x-kubernetes-validations OpenAPI extension. Kubernetes then
respects the OpenAPI spec to enforce our constraints.
To enforce a field's immutability after its first write, you need to apply the following constraints:
Field must be allowed to be initially unset +kubebuilder:validation:Optional
Once set, field must not be allowed to be removed: !has(oldSelf.value) | has(self.value) (type-scoped rule)
Once set, field must not be allowed to change value self == oldSelf (field-scoped rule)
Also note the additional directive +kubebuilder:validation:MaxLength. CEL
requires that all strings have attached max length so that it may estimate the
computation cost of the rule. Rules that are too expensive will be rejected.
For more information on CEL cost budgeting, check out the other tutorial.
Example usage
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincefirstwrites.yaml
customresourcedefinition.apiextensions.k8s.io/immutablesincefirstwrites.stable.example.com created
Creating initial empty object with no value is permitted since value is optional:
The ImmutableSinceFirstWrite "test1" is invalid: <nil>: Invalid value: "object": Value is required once set
Generated schema
Note that in the generated schema there are two separate rule locations.
One is directly attached to the property immutable_since_first_write.
The other rule is associated with the crd type itself.
openAPIV3Schema:
properties:
value:
maxLength: 512
type: string
x-kubernetes-validations:
- message: Value is immutable
rule: self == oldSelf
type: object
x-kubernetes-validations:
- message: Value is required once set
rule: '!has(oldSelf.value) || has(self.value)'
Immutability upon object creation
A field which is immutable upon creation time is implemented similarly to the
earlier example. The difference is that that field is marked required, and the
type-scoped rule is no longer necessary.
type ImmutableSinceCreation struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Required
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
// +kubebuilder:validation:MaxLength=512
Value string `json:"value"`
}
This field will be required when the object is created, and after that point will
not be allowed to be modified. Our CEL Validation Rule self == oldSelf
Usage example
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincecreations.yaml
customresourcedefinition.apiextensions.k8s.io/immutablesincecreations.stable.example.com created
Applying an object without the required field should fail:
The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation
Now that the field has been added, the operation is permitted:
immutablesincecreation.stable.example.com/test1 created
If you attempt to change the value, the operation is blocked due to the
validation rules in the CRD. Note that the error message is as it was defined
in the validation rule.
The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation
Generated schema
openAPIV3Schema:
properties:
value:
maxLength: 512
type: string
x-kubernetes-validations:
- message: Value is immutable
rule: self == oldSelf
required:
- value
type: object
Append-only list of containers
In the case of ephemeral containers on Pods, Kubernetes enforces that the
elements in the list are immutable, and can’t be removed. The following example
shows how you could use CEL to achieve the same behavior.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type AppendOnlyList struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
// +kubebuilder:validation:MaxItems=100
// +kubebuilder:validation:XValidation:rule="oldSelf.all(x, x in self)",message="Values may only be added"
Values []v1.EphemeralContainer `json:"value"`
}
Once set, field must not be deleted: !has(oldSelf.value) || has(self.value) (type-scoped)
Once a value is added it is not removed: oldSelf.all(x, x in self) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional
Note that for cost-budgeting purposes, MaxItems is also required to be specified.
Example usage
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_appendonlylists.yaml
customresourcedefinition.apiextensions.k8s.io/appendonlylists.stable.example.com created
Creating an inital list with one element inside should succeed without problem:
The AppendOnlyList "testlist" is invalid: <nil>: Invalid value: "object": Value is required once set
Generated schema
openAPIV3Schema:
properties:
value:
items: ...
maxItems: 100
type: array
x-kubernetes-validations:
- message: Values may only be added
rule: oldSelf.all(x, x in self)
type: object
x-kubernetes-validations:
- message: Value is required once set
rule: '!has(oldSelf.value) || has(self.value)'
Map with append-only keys, immutable values
// A map which does not allow keys to be removed or their values changed once set. New keys may be added, however.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.values) || has(self.values)", message="Value is required once set"
type MapAppendOnlyKeys struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
// +kubebuilder:validation:MaxProperties=10
// +kubebuilder:validation:XValidation:rule="oldSelf.all(key, key in self && self[key] == oldSelf[key])",message="Keys may not be removed and their values must stay the same"
Values map[string]string `json:"values,omitempty"`
}
Once set, field must not be deleted: !has(oldSelf.values) || has(self.values) (type-scoped)
Once a key is added it is not removed nor is its value modified: oldSelf.all(key, key in self && self[key] == oldSelf[key]) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional
Example usage
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_mapappendonlykeys.yaml
customresourcedefinition.apiextensions.k8s.io/mapappendonlykeys.stable.example.com created
Creating an initial object with one key within values should be permitted:
The MapAppendOnlyKeys "testmap" is invalid: values: Invalid value: "object": Keys may not be removed and their values must stay the same
If the entire field is removed, the other validation rule is triggered and the
operation is prevented. Note that the error message for the validation rule is
shown to the user.
The MapAppendOnlyKeys "testmap" is invalid: <nil>: Invalid value: "object": Value is required once set
Generated schema
openAPIV3Schema:
description: A map which does not allow keys to be removed or their values
changed once set. New keys may be added, however.
properties:
values:
additionalProperties:
type: string
maxProperties: 10
type: object
x-kubernetes-validations:
- message: Keys may not be removed and their values must stay the same
rule: oldSelf.all(key, key in self && self[key] == oldSelf[key])
type: object
x-kubernetes-validations:
- message: Value is required once set
rule: '!has(oldSelf.values) || has(self.values)'
Going further
The above examples showed how CEL rules can be added to kubebuilder types.
The same rules can be added directly to OpenAPI if writing a manifest for a CRD by hand.
For native types, the same behavior can be achieved using kube-openapi’s marker
+validations.
Usage of CEL within Kubernetes Validation Rules is so much more powerful than
what has been shown in this article. For more information please check out
validation rules
in the Kubernetes documentation and CRD Validation Rules Beta blog post.
The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure has already been beta since v1.17. CSI migration was introduced as alpha in Kubernetes v1.14.
Since then, SIG Storage and other Kubernetes special interest groups are working to ensure feature stability and compatibility in preparation for CSI Migration feature to go GA.
SIG Storage is excited to announce that the core CSI Migration feature is generally available in Kubernetes v1.25 release!
SIG Storage wrote a blog post in v1.23 for CSI Migration status update which discussed the CSI migration status for each storage driver. It has been a while and this article is intended to give a latest status update on each storage driver for their CSI Migration status in Kubernetes v1.25.
Quick recap: What is CSI Migration, and why migrate?
The Container Storage Interface (CSI) was designed to help Kubernetes replace its existing, in-tree storage driver mechanisms - especially vendor specific plugins.
Kubernetes support for the Container Storage Interface has been
generally available since Kubernetes v1.13.
Support for using CSI drivers was introduced to make it easier to add and maintain new integrations between Kubernetes and storage backend technologies. Using CSI drivers allows for better maintainability (driver authors can define their own release cycle and support lifecycle) and reduce the opportunity for vulnerabilities (with less in-tree code, the risks of a mistake are reduced, and cluster operators can select only the storage drivers that their cluster requires).
As more CSI Drivers were created and became production ready, SIG Storage wanted all Kubernetes users to benefit from the CSI model. However, we could not break API compatibility with the existing storage API types due to k8s architecture conventions. The solution we came up with was CSI migration: a feature that translates in-tree APIs to equivalent CSI APIs and delegates operations to a replacement CSI driver.
The CSI migration effort enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver from the storage backend.
If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. Existing StorageClass, PersistentVolume and PersistentVolumeClaim objects should continue to work.
When a Kubernetes cluster administrator updates a cluster to enable CSI migration, existing workloads that utilize PVCs which are backed by in-tree storage plugins will continue to function as they always have.
However, behind the scenes, Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers.
For example, suppose you are a kubernetes.io/gce-pd user; after CSI migration, you can still use kubernetes.io/gce-pd to provision new volumes, mount existing GCE-PD volumes or delete existing volumes. All existing APIs and Interface will still function correctly. However, the underlying function calls are all going through the GCE PD CSI driver instead of the in-tree Kubernetes function.
This enables a smooth transition for end users. Additionally as storage plugin developers, we can reduce the burden of maintaining the in-tree storage plugins and eventually remove them from the core Kubernetes binary.
What is the timeline / status?
The current and targeted releases for each individual driver is shown in the table below:
Driver
Alpha
Beta (in-tree deprecated)
Beta (on-by-default)
GA
Target "in-tree plugin" removal
AWS EBS
1.14
1.17
1.23
1.25
1.27 (Target)
Azure Disk
1.15
1.19
1.23
1.24
1.26 (Target)
Azure File
1.15
1.21
1.24
1.26 (Target)
1.28 (Target)
Ceph FS
1.26 (Target)
Ceph RBD
1.23
1.26 (Target)
1.27 (Target)
1.28 (Target)
1.30 (Target)
GCE PD
1.14
1.17
1.23
1.25
1.27 (Target)
OpenStack Cinder
1.14
1.18
1.21
1.24
1.26 (Target)
Portworx
1.23
1.25
1.26 (Target)
1.27 (Target)
1.29 (Target)
vSphere
1.18
1.19
1.25
1.26 (Target)
1.28 (Target)
The following storage drivers will not have CSI migration support.
The scaleio, flocker, quobyte and storageos drivers were removed; the others are deprecated and will be removed from core Kubernetes in the coming releases.
Driver
Deprecated
Code Removal
Flocker
1.22
1.25
GlusterFS
1.25
1.26 (Target)
Quobyte
1.22
1.25
ScaleIO
1.16
1.22
StorageOS
1.22
1.25
What does it mean for the core CSI Migration feature to go GA?
Core CSI Migration goes to GA means that the general framework, core library and API for CSI migration is
stable for Kubernetes v1.25 and will be part of future Kubernetes releases as well.
If you are a Kubernetes distribution maintainer, this means if you disabled CSIMigration feature gate previously, you are no longer allowed to do so because the feature gate has been locked.
If you are a Kubernetes storage driver developer, this means you can expect no backwards incompatibility changes in the CSI migration library.
If you are a Kubernetes maintainer, expect nothing changes from your day to day development flows.
If you are a Kubernetes user, expect nothing to change from your day-to-day usage flows. If you encounter any storage related issues, contact the people who operate your cluster (if that's you, contact the provider of your Kubernetes distribution, or get help from the community).
What does it mean for the storage driver CSI migration to go GA?
Storage Driver CSI Migration goes to GA means that the specific storage driver supports CSI Migration. Expect feature parity between the in-tree plugin with the CSI driver.
If you are a Kubernetes distribution maintainer, make sure you install the corresponding
CSI driver on the distribution. And make sure you are not disabling the specific CSIMigration{provider} flag, as they are locked.
If you are a Kubernetes storage driver maintainer, make sure the CSI driver can ensure feature parity if it supports CSI migration.
If you are a Kubernetes maintainer/developer, expect nothing to change from your day-to-day development flows.
If you are a Kubernetes user, the CSI Migration feature should be completely transparent
to you, the only requirement is to install the corresponding CSI driver.
What's next?
We are expecting cloud provider in-tree storage plugins code removal to start to happen as part of the v1.26 and v1.27 releases of Kubernetes. More and more drivers that support CSI migration will go GA in the upcoming releases.
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. We offer a huge thank you to the contributors who stepped up these last quarters to help move the project forward:
Xing Yang (xing-yang)
Hemant Kumar (gnufied)
Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CSI migration feature:
Andy Zhang (andyzhangz)
Divyen Patel (divyenpatel)
Deep Debroy (ddebroy)
Humble Devassy Chirammal (humblec)
Ismail Alidzhikov (ialidzhikov)
Jordan Liggitt (liggitt)
Matthew Cary (mattcary)
Matthew Wong (wongma7)
Neha Arora (nearora-msft)
Oksana Naumov (trierra)
Saad Ali (saad-ali)
Michelle Au (msau42)
Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.
Validation rules make it possible to declare how custom resources are validated using the Common Expression Language (CEL). For example:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
...
openAPIV3Schema:
type: object
properties:
spec:
type: object
x-kubernetes-validations:
- rule: "self.minReplicas <= self.replicas && self.replicas <= self.maxReplicas"
message: "replicas should be in the range minReplicas..maxReplicas."
properties:
replicas:
type: integer
...
Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:
Validation Rule
Purpose
self.minReplicas <= self.replicas
Validate an integer field is less than or equal to another integer field
'Available' in self.stateCounts
Validate an entry with the 'Available' key exists in a map
self.set1.all(e, !(e in self.set2))
Validate that the elements of two sets are disjoint
self == oldSelf
Validate that a required field is immutable once it is set
self.created + self.ttl < self.expired
Validate that 'expired' date is after a 'create' date plus a 'ttl' duration
Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.
Why CEL?
CEL was chosen as the language for validation rules for a couple reasons:
CEL expressions can easily be inlined into CRD schemas. They are sufficiently expressive to replace the vast majority of CRD validation checks currently implemented in admission webhooks. This results in CRDs that are self-contained and are easier to understand.
CEL expressions are compiled and type checked against a CRD's schema "ahead-of-time" (when CRDs are created and updated) allowing them to be evaluated efficiently and safely "runtime" (when custom resources are validated). Even regex string literals in CEL are validated and pre-compiled when CRDs are created or updated.
Why not use validation webhooks?
Benefits of using validation rules when compared with validation webhooks:
CRD authors benefit from a simpler workflow since validation rules eliminate the need to develop and maintain a webhook.
Cluster administrators benefit by no longer having to install, upgrade and operate webhooks for the purposes of CRD validation.
Cluster operability improves because CRD validation no longer requires a remote call to a webhook endpoint, eliminating a potential point of failure in the request-serving-path of the Kubernetes API server. This allows clusters to retain high availability while scaling to larger amounts of installed CRD extensions, since expected control plane availability would otherwise decrease with each additional webhook installed.
Getting started with validation rules
Writing validation rules in OpenAPIv3 schemas
You can define validation rules for any level of a CRD's OpenAPIv3 schema. Validation rules are automatically scoped to their location in the schema where they are declared.
Good practices for CRD validation rules:
Scope validation rules as close as possible to the fields(s) they validate.
Use multiple rules when validating independent constraints.
Do not use validation rules for validations already
Use OpenAPIv3 value validations (maxLength, maxItems, maxProperties, required, enum, minimum, maximum, ..) and string formats where available.
Use x-kubernetes-int-or-string, x-kubernetes-embedded-type and x-kubernetes-list-type=(set|map) were appropriate.
Examples of good practice:
Validation
Best Practice
Example(s)
Validate an integer is between 0 and 100.
Use OpenAPIv3 value validations.
type: integer minimum: 0 maximum: 100
Constraint the max size limits on maps (objects with additionalProperties), arrays and string.
Use OpenAPIv3 value validations. Recommended for all maps, arrays and strings. This best practice is essential for rule cost estimation (explained below).
type: maxItems: 100
Require a date-time be more recent than a particular timestamp.
Use OpenAPIv3 string formats to declare that the field is a date-time. Use validation rules to compare it to a particular timestamp.
Use x-kubernetes-list-type to validate that the arrays are sets. Use validation rules to validate the sets are disjoint.
type: object properties: set1: type: array x-kubernetes-list-type: set set2: ... x-kubernetes-validations: - rule: "!self.set1.all(e, !(e in self.set2))"
CRD transition rules
Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.
Transition rule examples:
Transition Rule
Purpose
self == oldSelf
For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.
(on parent of field) has(self.field) == has(oldSelf.field) on field: self == oldSelf
Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).
self.all(x, x in oldSelf)
Only allow adding items to a field that represents a set (prevent removals).
self >= oldSelf
Validate that a number is monotonically increasing.
Using the Functions Libraries
Validation rules have access to a couple different function libraries:
isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']
Validate that a URL has an allowed hostname.
self.map(x, x.weight).sum() == 1
Validate that the weights of a list of objects sum to 1.
int(self.find('^[0-9]*')) < 100
Validate that a string starts with a number less than 100.
self.isSorted()
Validates that a list is sorted.
Resource use and limits
To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.
Estimated cost limit
CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n^2) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.
Good practice:
Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.
Runtime cost limits for CRD validation rules
In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.
With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.
CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.
Future work
We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in an upcoming Kubernetes release!
There is a growing community of Kubernetes contributors thinking about how to make it possible to write extensible admission controllers using CEL as a substitute for admission webhooks for policy enforcement use cases. Anyone interested should reach out to us on the usual SIG API Machinery channels or via slack at #sig-api-machinery-cel-dev.
Acknowledgements
Special thanks to Cici Huang, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to Validation Rules!
Author: Humble Chirammal (Red Hat), Louis Koo (deeproute.ai)
Kubernetes v1.25, released earlier this month, introduced a new feature
that lets your cluster expand storage volumes, even when access to those
volumes requires a secret (for example: a credential for accessing a SAN fabric)
to perform node expand operation. This new behavior is in alpha and you
must enable a feature gate (CSINodeExpandSecret) to make use of it.
You must also be using CSI
storage; this change isn't relevant to storage drivers that are built in to Kubernetes.
To turn on this new, alpha feature, you enable the CSINodeExpandSecret feature
gate for the kube-apiserver and kubelet, which turns on a mechanism to send secretRef
configuration as part of NodeExpansion by the CSI drivers thus make use of
the same to perform node side expansion operation with the underlying
storage system.
What is this all about?
Before Kubernetes v1.24, you were able to define a cluster-level StorageClass
that made use of StorageClass Secrets,
but you didn't have any mechanism to specify the credentials that would be used for
operations that take place when the storage was mounted onto a node and when
the volume has to be expanded at node side.
The Kubernetes CSI already implemented a similar mechanism specific kinds of
volume resizes; namely, resizes of PersistentVolumes where the resizes take place
independently from any node referred as Controller Expansion. In that case, you
associate a PersistentVolume with a Secret that contains credentials for volume resize
actions, so that controller expansion can take place. CSI also supports a nodeExpandVolume
operation which CSI drivers can make use independent of Controller Expansion or along with
Controller Expansion on which, where the resize is driven from a node in your cluster where
the volume is attached. Please read Kubernetes 1.24: Volume Expansion Now A Stable Feature
At times, the CSI driver needs to check the actual size of the backend block storage (or image)
before proceeding with a node-level filesystem expand operation. This avoids false positive returns
from the backend storage cluster during filesystem expands.
When a PersistentVolume represents encrypted block storage (for example using LUKS)
you need to provide a passphrase in order to expand the device, and also to make it possible
to grow the filesystem on that device.
For various validations at time of node expansion, the CSI driver has to be connected
to the backend storage cluster. If the nodeExpandVolume request includes a secretRef
then the CSI driver can make use of the same and connect to the storage cluster to
perform the cluster operations.
How does it work?
To enable this functionality from this version of Kubernetes, SIG Storage have introduced
a new feature gate called CSINodeExpandSecret. Once the feature gate is enabled
in the cluster, NodeExpandVolume requests can include a secretRef field. The NodeExpandVolume request
is part of CSI; for example, in a request which has been sent from the Kubernetes
control plane to the CSI driver.
As a cluster operator, you admin can specify these secrets as an opaque parameter in a StorageClass,
the same way that you can already specify other CSI secret data. The StorageClass needs to have some
CSI-specific parameters set. Here's an example of those parameters:
If feature gates are enabled and storage class carries the above secret configuration,
the CSI provisioner receives the credentials from the Secret as part of the NodeExpansion request.
CSI volumes that require secrets for online expansion will have NodeExpandSecretRef
field set. If not set, the NodeExpandVolume CSI RPC call will be made without a secret.
Trying it out
Enable the CSINodeExpandSecret feature gate (please refer to
Feature Gates).
Create a Secret, and then a StorageClass that uses that Secret.
Here's an example manifest for a Secret that holds credentials:
Here's an example manifest for a StorageClass that refers to those credentials:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-blockstorage-sc
parameters:
csi.storage.k8s.io/node-expand-secret-name: test-secret # the name of the Secret
csi.storage.k8s.io/node-expand-secret-namespace: default # the namespace that the Secret is in
provisioner: blockstorage.cloudprovider.example
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
Example output
If the PersistentVolumeClaim (PVC) was created successfully, you can see that
configuration within the spec.csi field of the PersistentVolume (look for
spec.csi.nodeExpandSecretRef).
Check that it worked by running kubectl get persistentvolume <pv_name> -o yaml.
You should see something like.
If you then trigger online storage expansion, the kubelet passes the appropriate credentials
to the CSI driver, by loading that Secret and passing the data to the storage driver.
As this feature is still in alpha, Kubernetes Storage SIG expect to update or get feedback from CSI driver
authors with more tests and implementation. The community plans to eventually
promote the feature to Beta in upcoming releases.
Get involved or learn more?
The enhancement proposal includes lots of detail about the history and technical
implementation of this feature.
Please get involved by joining the Kubernetes
Storage SIG
(Special Interest Group) to help us enhance this feature.
There are a lot of good ideas already and we'd be thrilled to have more!
Local ephemeral storage capacity isolation was introduced as a alpha feature in Kubernetes 1.7 and it went beta in 1.9. With Kubernetes 1.25 we are excited to announce general availability(GA) of this feature.
Pods use ephemeral local storage for scratch space, caching, and logs. The lifetime of local ephemeral storage does not extend beyond the life of the individual pod. It is exposed to pods using the container’s writable layer, logs directory, and EmptyDir volumes. Before this feature was introduced, there were issues related to the lack of local storage accounting and isolation, such as Pods not knowing how much local storage is available and being unable to request guaranteed local storage. Local storage is a best-effort resource and pods can be evicted due to other pods filling the local storage.
The local storage capacity isolation feature allows users to manage local ephemeral storage in the same way as managing CPU and memory. It provides support for capacity isolation of shared storage between pods, such that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of shared storage exceeds that limit. It also allows setting ephemeral storage requests for resource reservation. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.
How to use local storage capacity isolation
A typical configuration for local ephemeral storage is to place all different kinds of ephemeral local data (emptyDir volumes, writeable layers, container images, logs) into one filesystem. Typically, both /var/lib/kubelet and /var/log are on the system's root filesystem. If users configure the local storage in different ways, kubelet might not be able to correctly measure disk usage and use this feature.
Setting requests and limits for local ephemeral storage
You can specify ephemeral-storage for managing local ephemeral storage. Each container of a Pod can specify either or both of the following:
In the following example, the Pod has two containers. The first container has a request of 8GiB of local ephemeral storage and a limit of 12GiB. The second container requests 2GiB of local storage, but no limit setting. Therefore, the Pod requests a total of 10GiB (8GiB+2GiB) of local ephemeral storage and enforces a limit of 12GiB of local ephemeral storage. It also sets emptyDir sizeLimit to 5GiB. With this setting in pod spec, it will affect how the scheduler makes a decision on scheduling pods and also how kubelet evict pods.
First of all, the scheduler ensures that the sum of the resource requests of the scheduled containers is less than the capacity of the node. In this case, the pod can be assigned to a node only if its available ephemeral storage (allocatable resource) has more than 10GiB.
Secondly, at container level, since one of the container sets resource limit, kubelet eviction manager will measure the disk usage of this container and evict the pod if the storage usage of the first container exceeds its limit (12GiB). At pod level, kubelet works out an overall Pod storage limit by
adding up the limits of all the containers in that Pod. In this case, the total storage usage at pod level is the sum of the disk usage from all containers plus the Pod's emptyDirvolumes. If this total usage exceeds the overall Pod storage limit (12GiB), then the kubelet also marks the Pod for eviction.
Last, in this example, emptyDir volume sets its sizeLimit to 5Gi. It means that if this pod's emptyDir used up more local storage than 5GiB, the pod will be evicted from the node.
Setting resource quota and limitRange for local ephemeral storage
This feature adds two more resource quotas for storage. The request and limit set constraints on the total requests/limits of all containers’ in a namespace.
Similar to CPU and memory, admin could use LimitRange to set default container’s local storage request/limit, and/or minimum/maximum resource constraints for a namespace.
Also, ephemeral-storage may be specified to reserve for kubelet or system. example, --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=10Gi][,][pid=1000] --kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=5Gi][,][pid=1000]. If your cluster node root disk capacity is 100Gi, after setting system-reserved and kube-reserved value, the available allocatable ephemeral storage would become 85Gi. The schedule will use this information to assign pods based on request and allocatable resources from each node. The eviction manager will also use allocatable resource to determine pod eviction. See more details from Reserve Compute Resources for System Daemons
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.
We offer a huge thank you to all the contributors in Kubernetes Storage SIG and CSI community who helped review the design and implementation of the project, including but not limited to the following: