Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you may not be able to execute some actions.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

Kubernetes News

Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta
• sorangutan

1

1
Posts

21
Views
Author: Xing Yang (VMware), Ashutosh Kumar (VMware)

Kubernetes v1.24 introduced an alpha quality implementation of improvements for handling a non-graceful node shutdown. In Kubernetes v1.26, this feature moves to beta. This feature allows stateful workloads to failover to a different node after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.

What is a node shutdown in Kubernetes?

In a Kubernetes cluster, it is possible for a node to shut down. This could happen either in a planned way or it could happen unexpectedly. You may plan for a security patch, or a kernel upgrade and need to reboot the node, or it may shut down due to preemption of VM instances. A node may also shut down due to a hardware failure or a software problem.

To trigger a node shutdown, you could run a shutdown or poweroff command in a shell, or physically press a button to power off a machine.

A node shutdown could lead to workload failure if the node is not drained before the shutdown.

In the following, we will describe what is a graceful node shutdown and what is a non-graceful node shutdown.

What is a graceful node shutdown?

The kubelet's handling for a graceful node shutdown allows the kubelet to detect a node shutdown event, properly terminate the pods on that node, and release resources before the actual shutdown. Critical pods are terminated after all the regular pods are terminated, to ensure that the essential functions of an application can continue to work as long as possible.

What is a non-graceful node shutdown?

A Node shutdown can be graceful only if the kubelet's node shutdown manager can detect the upcoming node shutdown action. However, there are cases where a kubelet does not detect a node shutdown action. This could happen because the shutdown command does not trigger the Inhibitor Locks mechanism used by the kubelet on Linux, or because of a user error. For example, if the shutdownGracePeriod and shutdownGracePeriodCriticalPods details are not configured correctly for that node.

When a node is shut down (or crashes), and that shutdown was not detected by the kubelet node shutdown manager, it becomes a non-graceful node shutdown. Non-graceful node shutdown is a problem for stateful apps. If a node containing a pod that is part of a StatefulSet is shut down in a non-graceful way, the Pod will be stuck in Terminating status indefinitely, and the control plane cannot create a replacement Pod for that StatefulSet on a healthy node. You can delete the failed Pods manually, but this is not ideal for a self-healing cluster. Similarly, pods that ReplicaSets created as part of a Deployment will be stuck in Terminating status, and that were bound to the now-shutdown node, stay as Terminating indefinitely. If you have set a horizontal scaling limit, even those terminating Pods count against the limit, so your workload may struggle to self-heal if it was already at maximum scale. (By the way: if the node that had done a non-graceful shutdown comes back up, the kubelet does delete the old Pod, and the control plane can make a replacement.)

What's new for the beta?

For Kubernetes v1.26, the non-graceful node shutdown feature is beta and enabled by default. The NodeOutOfServiceVolumeDetach feature gate is enabled by default on kube-controller-manager instead of being opt-in; you can still disable it if needed (please also file an issue to explain the problem).

On the instrumentation side, the kube-controller-manager reports two new metrics.

force_delete_pods_total

number of pods that are being forcibly deleted (resets on Pod garbage collection controller restart)

force_delete_pod_errors_total

number of errors encountered when attempting forcible Pod deletion (also resets on Pod garbage collection controller restart)

How does it work?

In the case of a node shutdown, if a graceful shutdown is not working or the node is in a non-recoverable state due to hardware failure or broken OS, you can manually add an out-of-service taint on the Node. For example, this can be node.kubernetes.io/out-of-service=nodeshutdown:NoExecute or node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule. This taint trigger pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node.
```
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
```
Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting), either because the user intentionally shut it down or the node is down due to hardware failures, OS issues, etc.

Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered.

What’s next?

Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to GA in either 1.27 or 1.28.

This feature requires a user to manually add a taint to the node to trigger the failover of workloads and remove the taint after the node is recovered.

The cluster operator can automate this process by automatically applying the out-of-service taint if there is a programmatic way to determine that the node is really shut down and there isn’t IO between the node and storage. The cluster operator can then automatically remove the taint after the workload fails over successfully to another running node and that the shutdown node has been recovered.

In the future, we plan to find ways to automatically detect and fence nodes that are shut down or in a non-recoverable state and fail their workloads over to another node.

How can I learn more?

To learn more, read Non Graceful node shutdown in the Kubernetes documentation.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature:
- Michelle Au (msau42)
- Derek Carr (derekwaynecarr)
- Danielle Endocrimes (endocrimes)
- Tim Hockin (thockin)
- Ashutosh Kumar (sonasingh46)
- Hemant Kumar (gnufied)
- Yuiko Mouri(YuikoTakada)
- Mrunal Patel (mrunalp)
- David Porter (bobbypage)
- Yassine Tijani (yastij)
- Jing Xu (jingxu97)
- Xing Yang (xing-yang)
There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the KEP and implementation over the last couple of years.

This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.

https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/
Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation
• sorangutan

1

1
Posts

21
Views
Authors: Patrick Ohly (Intel), Kevin Klues (NVIDIA)

Dynamic resource allocation is a new API for requesting resources. It is a generalization of the persistent volumes API for generic resources, making it possible to:
- access the same resource instance in different pods and containers,
- attach arbitrary constraints to a resource request to get the exact resource you are looking for,
- initialize a resource according to parameters provided by the user.
Third-party resource drivers are responsible for interpreting these parameters as well as tracking and allocating resources as requests come in.

Dynamic resource allocation is an alpha feature and only enabled when the DynamicResourceAllocation feature gate and the resource.k8s.io/v1alpha1 API group are enabled. For details, see the --feature-gates and --runtime-config kube-apiserver parameters. The kube-scheduler, kube-controller-manager and kubelet components all need the feature gate enabled as well.

The default configuration of kube-scheduler enables the DynamicResources plugin if and only if the feature gate is enabled. Custom configurations may have to be modified to include it.

Once dynamic resource allocation is enabled, resource drivers can be installed to manage certain kinds of hardware. Kubernetes has a test driver that is used for end-to-end testing, but also can be run manually. See below for step-by-step instructions.

API

The new resource.k8s.io/v1alpha1 API group provides four new types:

ResourceClass

Defines which resource driver handles a certain kind of resource and provides common parameters for it. ResourceClasses are created by a cluster administrator when installing a resource driver.

ResourceClaim

Defines a particular resource instances that is required by a workload. Created by a user (lifecycle managed manually, can be shared between different Pods) or for individual Pods by the control plane based on a ResourceClaimTemplate (automatic lifecycle, typically used by just one Pod).

ResourceClaimTemplate

Defines the spec and some meta data for creating ResourceClaims. Created by a user when deploying a workload.

PodScheduling

Used internally by the control plane and resource drivers to coordinate pod scheduling when ResourceClaims need to be allocated for a Pod.

Parameters for ResourceClass and ResourceClaim are stored in separate objects, typically using the type defined by a CRD that was created when installing a resource driver.

With this alpha feature enabled, the spec of Pod defines ResourceClaims that are needed for a Pod to run: this information goes into a new resourceClaims field. Entries in that list reference either a ResourceClaim or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using this .spec (for example, inside a Deployment or StatefulSet) share the same ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets its own ResourceClaim instance.

For a container defined within a Pod, the resources.claims list defines whether that container gets access to these resource instances, which makes it possible to share resources between one or more containers inside the same Pod. For example, an init container could set up the resource before the application uses it.

Here is an example of a fictional resource driver. Two ResourceClaim objects will get created for this Pod and each container gets access to one of them.

Assuming a resource driver called resource-driver.example.com was installed together with the following resource class:
```
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClass
name: resource.example.com
driverName: resource-driver.example.com
```
An end-user could then allocate two specific resources of type resource.example.com as follows:
```
---
apiVersion: cats.resource.example.com/v1
kind: ClaimParameters
name: large-black-cats
spec:
 color: black
 size: large
---
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
 name: large-black-cats
spec:
 spec:
 resourceClassName: resource.example.com
 parametersRef:
 apiGroup: cats.resource.example.com
 kind: ClaimParameters
 name: large-black-cats
–--
apiVersion: v1
kind: Pod
metadata:
 name: pod-with-cats
spec:
 containers: # two example containers; each container claims one cat resource
 - name: first-example
 image: ubuntu:22.04
 command: ["sleep", "9999"]
 resources:
 claims:
 - name: cat-0
 - name: second-example
 image: ubuntu:22.04
 command: ["sleep", "9999"]
 resources:
 claims:
 - name: cat-1
 resourceClaims:
 - name: cat-0
 source:
 resourceClaimTemplateName: large-black-cats
 - name: cat-1
 source:
 resourceClaimTemplateName: large-black-cats
```
Scheduling

In contrast to native resources (such as CPU or RAM) and extended resources (managed by a device plugin, advertised by kubelet), the scheduler has no knowledge of what dynamic resources are available in a cluster or how they could be split up to satisfy the requirements of a specific ResourceClaim. Resource drivers are responsible for that. Drivers mark ResourceClaims as allocated once resources for it are reserved. This also then tells the scheduler where in the cluster a claimed resource is actually available.

ResourceClaims can get resources allocated as soon as the ResourceClaim is created (immediate allocation), without considering which Pods will use the resource. The default (wait for first consumer) is to delay allocation until a Pod that relies on the ResourceClaim becomes eligible for scheduling. This design with two allocation options is similar to how Kubernetes handles storage provisioning with PersistentVolumes and PersistentVolumeClaims.

In the wait for first consumer mode, the scheduler checks all ResourceClaims needed by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling (a special object that requests scheduling details on behalf of the Pod). The PodScheduling has the same name and namespace as the Pod and the Pod as its as owner. Using its PodScheduling, the scheduler informs the resource drivers responsible for those ResourceClaims about nodes that the scheduler considers suitable for the Pod. The resource drivers respond by excluding nodes that don't have enough of the driver's resources left.

Once the scheduler has that resource information, it selects one node and stores that choice in the PodScheduling object. The resource drivers then allocate resources based on the relevant ResourceClaims so that the resources will be available on that selected node. Once that resource allocation is complete, the scheduler attempts to schedule the Pod to a suitable node. Scheduling can still fail at this point; for example, a different Pod could be scheduled to the same node in the meantime. If this happens, already allocated ResourceClaims may get deallocated to enable scheduling onto a different node.

As part of this process, ResourceClaims also get reserved for the Pod. Currently ResourceClaims can either be used exclusively by a single Pod or an unlimited number of Pods.

One key feature is that Pods do not get scheduled to a node unless all of their resources are allocated and reserved. This avoids the scenario where a Pod gets scheduled onto one node and then cannot run there, which is bad because such a pending Pod also blocks all other resources like RAM or CPU that were set aside for it.

Limitations

The scheduler plugin must be involved in scheduling Pods which use ResourceClaims. Bypassing the scheduler by setting the nodeName field leads to Pods that the kubelet refuses to start because the ResourceClaims are not reserved or not even allocated. It may be possible to remove this limitation in the future.

Writing a resource driver

A dynamic resource allocation driver typically consists of two separate-but-coordinating components: a centralized controller, and a DaemonSet of node-local kubelet plugins. Most of the work required by the centralized controller to coordinate with the scheduler can be handled by boilerplate code. Only the business logic required to actually allocate ResourceClaims against the ResourceClasses owned by the plugin needs to be customized. As such, Kubernetes provides the following package, including APIs for invoking this boilerplate code as well as a Driver interface that you can implement to provide their custom business logic:
- k8s.io/dynamic-resource-allocation/controller
Likewise, boilerplate code can be used to register the node-local plugin with the kubelet, as well as start a gRPC server to implement the kubelet plugin API. For drivers written in Go, the following package is recommended:
- k8s.io/dynamic-resource-allocation/kubeletplugin
It is up to the driver developer to decide how these two components communicate. The KEP outlines an approach using CRDs.

Within SIG Node, we also plan to provide a complete example driver that can serve as a template for other drivers.

Running the test driver

The following steps bring up a local, one-node cluster directly from the Kubernetes source code. As a prerequisite, your cluster must have nodes with a container runtime that supports the Container Device Interface (CDI). For example, you can run CRI-O v1.23.2 or later. Once containerd v1.7.0 is released, we expect that you can run that or any later version. In the example below, we use CRI-O.

First, clone the Kubernetes source code. Inside that directory, run:
```
$ hack/install-etcd.sh
...

$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \
 FEATURE_GATES=DynamicResourceAllocation=true \
 DNS_ADDON="coredns" \
 CGROUP_DRIVER=systemd \
 CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \
 LOG_LEVEL=6 \
 ENABLE_CSI_SNAPSHOTTER=false \
 API_SECURE_PORT=6444 \
 ALLOW_PRIVILEGED=1 \
 PATH=$(pwd)/third_party/etcd:$PATH \
 ./hack/local-up-cluster.sh -O
...
To start using your cluster, you can open up another terminal/tab and run:

 export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
...
```
Once the cluster is up, in another terminal run the test driver controller. KUBECONFIG must be set for all of the following commands.
```
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller
```
In another terminal, run the kubelet plugin:
```
$ sudo mkdir -p /var/run/cdi && \
 sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin
```
Changing the permissions of the directories makes it possible to run and (when using delve) debug the kubelet plugin as a normal user, which is convenient because it uses the already populated Go cache. Remember to restore permissions with sudo chmod go-w when done. Alternatively, you can also build the binary and run that as root.

Now the cluster is ready to create objects:
```
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml
resourceclass.resource.k8s.io/example created

$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
configmap/test-inline-claim-parameters created
resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created
pod/test-inline-claim created

$ kubectl get resourceclaims
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
test-inline-claim-resource example WaitForFirstConsumer allocated,reserved 8s

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-inline-claim 0/2 Completed 0 21s
```
The test driver doesn't do much, it only sets environment variables as defined in the ConfigMap. The test pod dumps the environment, so the log can be checked to verify that everything worked:
```
$ kubectl logs test-inline-claim with-resource | grep user_a
user_a='b'
```
Next steps
- See the Dynamic Resource Allocation KEP for more information on the design.
- Read Dynamic Resource Allocation in the official Kubernetes documentation.
- You can participate in SIG Node and / or the CNCF Container Orchestrated Device Working Group.
- You can view or comment on the project board for dynamic resource allocation.
- In order to move this feature towards beta, we need feedback from hardware vendors, so here's a call to action: try out this feature, consider how it can help with problems that your users are having, and write resource drivers…
https://kubernetes.io/blog/2022/12/15/dynamic-resource-allocation/
Blog: Kubernetes 1.26: Windows HostProcess Containers Are Generally Available
• sorangutan

1

1
Posts

14
Views
Authors: Brandon Smith (Microsoft) and Mark Rossetti (Microsoft)

The long-awaited day has arrived: HostProcess containers, the Windows equivalent to Linux privileged containers, has finally made it to GA in Kubernetes 1.26!

What are HostProcess containers and why are they useful?

Cluster operators are often faced with the need to configure their nodes upon provisioning such as installing Windows services, configuring registry keys, managing TLS certificates, making network configuration changes, or even deploying monitoring tools such as a Prometheus's node-exporter. Previously, performing these actions on Windows nodes was usually done by running PowerShell scripts over SSH or WinRM sessions and/or working with your cloud provider's virtual machine management tooling. HostProcess containers now enable you to do all of this and more with minimal effort using Kubernetes native APIs.

With HostProcess containers you can now package any payload into the container image, map volumes into containers at runtime, and manage them like any other Kubernetes workload. You get all the benefits of containerized packaging and deployment methods combined with a reduction in both administrative and development cost. Gone are the days where cluster operators would need to manually log onto Windows nodes to perform administrative duties.

HostProcess containers differ quite significantly from regular Windows Server containers. They are run directly as processes on the host with the access policies of a user you specify. HostProcess containers run as either the built-in Windows system accounts or ephemeral users within a user group defined by you. HostProcess containers also share the host's network namespace and access/configure storage mounts visible to the host. On the other hand, Windows Server containers are highly isolated and exist in a separate execution namespace. Direct access to the host from a Windows Server container is explicitly disallowed by default.

How does it work?

Windows HostProcess containers are implemented with Windows Job Objects, a break from the previous container model which use server silos. Job Objects are components of the Windows OS which offer the ability to manage a group of processes as a group (also known as a job) and assign resource constraints to the group as a whole. Job objects are specific to the Windows OS and are not associated with the Kubernetes Job API. They have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the desired permissions, among other host resources. The init process, and any processes it launches (including processes explicitly launched by the user) are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.

HostProcess and Linux privileged containers enable similar scenarios but differ greatly in their implementation (hence the naming difference). HostProcess containers have their own PodSecurityContext fields. Those used to configure Linux privileged containers do not apply. Enabling privileged access to a Windows host is a fundamentally different process than with Linux so the configuration and capabilities of each differ significantly. Below is a diagram detailing the overall architecture of Windows HostProcess containers:

Two major features were added prior to moving to stable: the ability to run as local user accounts, and a simplified method of accessing volume mounts. To learn more, read Create a Windows HostProcess Pod.

HostProcess containers in action

Kubernetes SIG Windows has been busy putting HostProcess containers to use - even before GA! They've been very excited to use HostProcess containers for a number of important activities that were a pain to perform in the past.

Here are just a few of the many use use cases with example deployments:
How do I use it?

A HostProcess container can be built using any base image of your choosing, however, for convenience we have created a HostProcess container base image. This image is only a few KB in size and does not inherit any of the same compatibility requirements as regular Windows server containers which allows it to run on any Windows server version.

To use that Microsoft image, put this in your Dockerfile:
```
FROM mcr.microsoft.com/oss/kubernetes/windows-host-process-containers-base-image:v1.0.0
```
You can run HostProcess containers from within a HostProcess Pod.

To get started with running Windows containers, see the general guidance for deploying Windows nodes. If you have a compatible node (for example: Windows as the operating system with containerd v1.7 or later as the container runtime), you can deploy a Pod with one or more HostProcess containers. See the Create a Windows HostProcess Pod - Prerequisites for more information.

Please note that within a Pod, you can't mix HostProcess containers with normal Windows containers.

How can I learn more?
- Work through Create a Windows HostProcess Pod
- Read about Kubernetes Pod Security Standards and Pod Security Admission
- Read the enhancement proposal Windows Privileged Containers and Host Networking Mode (KEP-1981)
- Watch the Windows HostProcess for Configuration and Beyond KubeCon NA 2022 talk
How do I get involved?

Get involved with SIG Windows to contribute!

https://kubernetes.io/blog/2022/12/13/windows-host-process-containers-ga/
Blog: Kubernetes 1.26: We're now signing our binary release artifacts!
• sorangutan

1

1
Posts

15
Views
Author: Sascha Grunert

The Kubernetes Special Interest Group (SIG) Release is proud to announce that we are digitally signing all release artifacts, and that this aspect of Kubernetes has now reached beta.

Signing artifacts provides end users a chance to verify the integrity of the downloaded resource. It allows to mitigate man-in-the-middle attacks directly on the client side and therefore ensures the trustfulness of the remote serving the artifacts. The overall goal of out past work was to define the used tooling for signing all Kubernetes related artifacts as well as providing a standard signing process for related projects (for example for those in kubernetes-sigs).

We already signed all officially released container images (from Kubernetes v1.24 onwards). Image signing was alpha for v1.24 and v1.25. For v1.26, we've added all binary artifacts to the signing process as well! This means that now all client, server and source tarballs, binary artifacts, Software Bills of Material (SBOMs) as well as the build provenance will be signed using cosign. Technically speaking, we now ship additional *.sig (signature) and *.cert (certificate) files side by side to the artifacts for verifying their integrity.

To verify an artifact, for example kubectl, you can download the signature and certificate alongside with the binary. I use the release candidate rc.1 of v1.26 for demonstration purposes because the final has not been released yet:
```
curl -sSfL https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl -o kubectl
curl -sSfL https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.sig -o kubectl.sig
curl -sSfL https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.cert -o kubectl.cert
```
Then you can verify kubectl using cosign:
```
COSIGN_EXPERIMENTAL=1 cosign verify-blob kubectl --signature kubectl.sig --certificate kubectl.cert
```
```
tlog entry verified with uuid: 5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657 index: 8173886
Verified OK
```
The UUID can be used to query the rekor transparency log:
```
rekor-cli get --uuid 5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657
```
```
LogID: c0d23d6ad406973f9559f3ba2d1ca01f84147d8ffc5b8445c224f98b9591801d
Index: 8173886
IntegratedTime: 2022-11-30T18:59:07Z
UUID: 24296fb24b8ad77a5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657
Body: {
"HashedRekordObj": {
"data": {
"hash": {
"algorithm": "sha256",
"value": "982dfe7eb5c27120de6262d30fa3e8029bc1da9e632ce70570e9c921d2851fc2"
}
},
"signature": {
"content": "MEQCIH0e1/0svxMoLzjeyhAaLFSHy5ZaYy0/2iQl2t3E0Pj4AiBsWmwjfLzrVyp9/v1sy70Q+FHE8miauOOVkAW2lTYVug==",
"publicKey": {
"content": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN2akNDQWthZ0F3SUJBZ0lVRldab0pLSUlFWkp3LzdsRkFrSVE2SHBQdi93d0NnWUlLb1pJemowRUF3TXcKTnpFVk1CTUdBMVVFQ2hNTWMybG5jM1J2Y21VdVpHVjJNUjR3SEFZRFZRUURFeFZ6YVdkemRHOXlaUzFwYm5SbApjbTFsWkdsaGRHVXdIaGNOTWpJeE1UTXdNVGcxT1RBMldoY05Nakl4TVRNd01Ua3dPVEEyV2pBQU1Ga3dFd1lICktvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVDT3h5OXBwTFZzcVFPdHJ6RFgveTRtTHZSeU1scW9sTzBrS0EKTlJDM3U3bjMreHorYkhvWVkvMUNNRHpJQjBhRTA3NkR4ZWVaSkhVaWFjUXU4a0dDNktPQ0FXVXdnZ0ZoTUE0RwpBMVVkRHdFQi93UUVBd0lIZ0RBVEJnTlZIU1VFRERBS0JnZ3JCZ0VGQlFjREF6QWRCZ05WSFE0RUZnUVV5SmwxCkNlLzIzNGJmREJZQ2NzbXkreG5qdnpjd0h3WURWUjBqQkJnd0ZvQVUzOVBwejFZa0VaYjVxTmpwS0ZXaXhpNFkKWkQ4d1FnWURWUjBSQVFIL0JEZ3dOb0UwYTNKbGJDMXpkR0ZuYVc1blFHczRjeTF5Wld4bGJtY3RjSEp2WkM1cApZVzB1WjNObGNuWnBZMlZoWTJOdmRXNTBMbU52YlRBcEJnb3JCZ0VFQVlPL01BRUJCQnRvZEhSd2N6b3ZMMkZqClkyOTFiblJ6TG1kdmIyZHNaUzVqYjIwd2dZb0dDaXNHQVFRQjFua0NCQUlFZkFSNkFIZ0FkZ0RkUFRCcXhzY1IKTW1NWkhoeVpaemNDb2twZXVONDhyZitIaW5LQUx5bnVqZ0FBQVlUSjZDdlJBQUFFQXdCSE1FVUNJRXI4T1NIUQp5a25jRFZpNEJySklXMFJHS0pqNkQyTXFGdkFMb0I5SmNycXlBaUVBNW4xZ283cmQ2U3ZVeXNxeldhMUdudGZKCllTQnVTZHF1akVySFlMQTUrZTR3Q2dZSUtvWkl6ajBFQXdNRFpnQXdZd0l2Tlhub3pyS0pWVWFESTFiNUlqa1oKUWJJbDhvcmlMQ1M4MFJhcUlBSlJhRHNCNTFUeU9iYTdWcGVYdThuTHNjVUNNREU4ZmpPZzBBc3ZzSXp2azNRUQo0c3RCTkQrdTRVV1UrcjhYY0VxS0YwNGJjTFQwWEcyOHZGQjRCT2x6R204K093PT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo="
}
}
}
}
```
The HashedRekordObj.signature.content should match the content of the file kubectl.sig and HashedRekordObj.signature.publicKey.content should be identical with the contents of kubectl.cert. It is also possible to specify the remote certificate and signature locations without downloading them:
```
COSIGN_EXPERIMENTAL=1 cosign verify-blob kubectl \
 --signature https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.sig \
 --certificate https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.cert
```
```
tlog entry verified with uuid: 5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657 index: 8173886
Verified OK
```
All of the mentioned steps as well as how to verify container images are outlined in the official documentation about how to Verify Signed Kubernetes Artifacts. In one of the next upcoming Kubernetes releases we will working making the global story more mature by ensuring that truly all Kubernetes artifacts are signed. Beside that, we are considering using Kubernetes owned infrastructure for the signing (root trust) and verification (transparency log) process.

Getting involved

If you're interested in contributing to SIG Release, then consider applying for the upcoming v1.27 shadowing program (watch for the announcement on k-dev) or join our weekly meeting to say hi.

We're looking forward to making even more of those awesome changes for future Kubernetes releases. For example, we're working on the SLSA Level 3 Compliance in the Kubernetes Release Process or the Renaming of the kubernetes/kubernetes default branch name to main.

Thank you for reading this blog post! I'd like to use this opportunity to give all involved SIG Release folks a special shout-out for shipping this feature in time!

Feel free to reach out to us by using the SIG Release mailing list or the #sig-release Slack channel.

Additional resources
- Signing Release Artifacts Enhancement Proposal
https://kubernetes.io/blog/2022/12/12/kubernetes-release-artifact-signing/
Blog: Kubernetes v1.26: Electrifying
• sorangutan

1

1
Posts

18
Views
Authors: Kubernetes 1.26 Release Team

It's with immense joy that we announce the release of Kubernetes v1.26!

This release includes a total of 37 enhancements: eleven of them are graduating to Stable, ten are graduating to Beta, and sixteen of them are entering Alpha. We also have twelve features being deprecated or removed, three of which we better detail in this announcement.

Release theme and logo

Kubernetes 1.26: Electrifying

The theme for Kubernetes v1.26 is Electrifying.

Each Kubernetes release is the result of the coordinated effort of dedicated volunteers, and only made possible due to the use of a diverse and complex set of computing resources, spread out through multiple datacenters and regions worldwide. The end result of a release - the binaries, the image containers, the documentation - are then deployed on a growing number of personal, on-premises, and cloud computing resources.

In this release we want to recognise the importance of all these building blocks on which Kubernetes is developed and used, while at the same time raising awareness on the importance of taking the energy consumption footprint into account: environmental sustainability is an inescapable concern of creators and users of any software solution, and the environmental footprint of sofware, like Kubernetes, an area which we believe will play a significant role in future releases.

As a community, we always work to make each new release process better than before (in this release, we have started to use Projects for tracking enhancements, for example). If v1.24 "Stargazer" had us looking upwards, to what is possible when our community comes together, and v1.25 "Combiner" what the combined efforts of our community are capable of, this v1.26 "Electrifying" is also dedicated to all of those whose individual motion, integrated into the release flow, made all of this possible.

Major themes

Kubernetes v1.26 is composed of many changes, brought to you by a worldwide team of volunteers. For this release, we have identified several major themes.

Change in container image registry

In the previous release, Kubernetes changed the container registry, allowing the spread of the load across multiple Cloud Providers and Regions, a change that reduced the reliance on a single entity and provided a faster download experience for a large number of users.

This release of Kubernetes is the first that is exclusively published in the new registry.k8s.io container image registry. In the (now legacy) k8s.gcr.io image registry, no container images tags for v1.26 will be published, and only tags from releases before v1.26 will continue to be updated. Refer to registry.k8s.io: faster, cheaper and Generally Available for more information on the motivation, advantages, and implications of this significant change.

CRI v1alpha2 removed

With the adoption of the Container Runtime Interface (CRI) and the removal of dockershim in v1.24, the CRI is the only supported and documented way through which Kubernetes interacts with different container runtimes. Each kubelet negotiates which version of CRI to use with the container runtime on that node.

In the previous release, the Kubernetes project recommended using CRI version v1, but kubelet could still negotiate the use of CRI v1alpha2, which was deprecated.

Kubernetes v1.26 drops support for CRI v1alpha2. That removal will result in the kubelet not registering the node if the container runtime doesn't support CRI v1. This means that containerd minor version 1.5 and older are not supported in Kubernetes 1.26; if you use containerd, you will need to upgrade to containerd version 1.6.0 or later before you upgrade that node to Kubernetes v1.26. This applies equally to any other container runtimes that only support the v1alpha2: if that affects you, you should contact the container runtime vendor for advice or check their website for additional instructions in how to move forward.

Storage improvements

Following the GA of the core Container Storage Interface (CSI) Migration feature in the previous release, CSI migration is an on-going effort that we've been working on for a few releases now, and this release continues to add (and remove) features aligned with the migration's goals, as well as other improvements to Kubernetes storage.

CSI migration for Azure File and vSphere graduated to stable

Both the vSphere and Azure in-tree driver migration to CSI have graduated to Stable. You can find more information about them in the vSphere CSI driver and Azure File CSI driver repositories.

Delegate FSGroup to CSI Driver graduated to stable

This feature allows Kubernetes to supply the pod's fsGroup to the CSI driver when a volume is mounted so that the driver can utilize mount options to control volume permissions. Previously, the kubelet would always apply the fsGroupownership and permission change to files in the volume according to the policy specified in the Pod's .spec.securityContext.fsGroupChangePolicy field. Starting with this release, CSI drivers have the option to apply the fsGroup settings during attach or mount time of the volumes.

In-tree GlusterFS driver removal

Already deprecated in the v1.25 release, the in-tree GlusterFS driver was removed in this release.

In-tree OpenStack Cinder driver removal

This release removed the deprecated in-tree storage integration for OpenStack (the cinder volume type). You should migrate to external cloud provider and CSI driver from https://github.com/kubernetes/cloud-provider-openstack instead. For more information, visit Cinder in-tree to CSI driver migration.

Signing Kubernetes release artifacts graduates to beta

Introduced in Kubernetes v1.24, this feature constitutes a significant milestone in improving the security of the Kubernetes release process. All release artifacts are signed keyless using cosign, and both binary artifacts and images can be verified.

Support for Windows privileged containers graduates to stable

Privileged container support allows containers to run with similar access to the host as processes that run on the host directly. Support for this feature in Windows nodes, called HostProcess containers, will now graduate to Stable, enabling access to host resources (including network resources) from privileged containers.

Improvements to Kubernetes metrics

This release has several noteworthy improvements on metrics.

Metrics framework extension graduates to alpha

The metrics framework extension graduates to Alpha, and documentation is now published for every metric in the Kubernetes codebase.This enhancement adds two additional metadata fields to Kubernetes metrics: Internal and Beta, representing different stages of metric maturity.

Component Health Service Level Indicators graduates to alpha

Also improving on the ability to consume Kubernetes metrics, component health Service Level Indicators (SLIs) have graduated to Alpha: by enabling the ComponentSLIs feature flag there will be an additional metrics endpoint which allows the calculation of Service Level Objectives (SLOs) from raw healthcheck data converted into metric format.

Feature metrics are now available

Feature metrics are now available for each Kubernetes component, making it possible to track whether each active feature gate is enabled by checking the component's metric endpoint for kubernetes_feature_enabled.

Dynamic Resource Allocation graduates to alpha

Dynamic Resource Allocation is a new feature that puts resource scheduling in the hands of third-party developers: it offers an alternative to the limited "countable" interface for requesting access to resources (e.g. nvidia.com/gpu: 2), providing an API more akin to that of persistent volumes. Under the hood, it uses the Container Device Interface (CDI) to do its device injection. This feature is blocked by the DynamicResourceAllocation feature gate.

CEL in Admission Control graduates to alpha

This feature introduces a v1alpha1 API for validating admission policies, enabling extensible admission control via Common Expression Language expressions. Currently, custom policies are enforced via admission webhooks, which, while flexible, have a few drawbacks when compared to in-process policy enforcement. To use, enable the ValidatingAdmissionPolicy feature gate and the admissionregistration.k8s.io/v1alpha1 API via --runtime-config.

Pod scheduling improvements

Kubernetes v1.26 introduces some relevant enhancements to the ability to better control scheduling behavior.

PodSchedulingReadiness graduates to alpha

This feature introduces a .spec.schedulingGates field to Pod's API, to indicate whether the Pod is allowed to be scheduled or not. External users/controllers can use this field to hold a Pod from scheduling based on their policies and needs.

NodeInclusionPolicyInPodTopologySpread graduates to beta

By specifying a nodeInclusionPolicy in topologySpreadConstraints, you can control whether to take taints/tolerations into consideration when calculating Pod Topology Spread skew.

Other Updates

Graduations to stable

This release includes a total of eleven enhancements promoted to Stable:
Deprecations and removals

12 features were deprecated or removed from Kubernetes with this release.
Release notes

The complete details of the Kubernetes v1.26 release are available in our release notes.

Availability

Kubernetes v1.26 is available for download on the Kubernetes site. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using containers as "nodes", with kind. You can also easily install v1.26 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.26 release for our community.

A very special thanks is in order for our Release Lead, Leonard Pahlke, for successfully steering the entire release team throughout the entire release cycle, by making sure that we could all contribute in the best way possible to this release through his constant support and attention to the many and diverse details that make up the path to a successful release.

User highlights
- Wortell faced increasingly higher ammounts of developer expertise and time for daily infrastructure management. They used Dapr to reduce the complexity and amount of required infrastructure-related code, allowing them to focus more time on new features.
- Utmost handles sensitive personal data and needed SOC 2 Type II attestation, ISO 27001 certification, and zero trust networking. Using Cilium, they created automated pipelines that allowed developers to create new policies, supporting over 4,000 flows per second.
- Global cybersecurity company Ericom’s solutions depend on hyper-low latency and data security. With Ridge's managed Kubernetes service they were able to deploy, through a single API, to a network of service providers worldwide.
- Lunar, a Scandinavian online bank, wanted to implement quarterly production cluster failover testing to prepare for disaster recovery, and needed a better way to managed their platform services.They started by centralizing their log management system, and followed-up with the centralization of all platform services, using Linkerd to connect the clusters.
- Datadog runs 10s of clusters with 10,000+ of nodes and 100,000+ pods across multiple cloud providers.They turned to Cilium as their CNI and kube-proxy replacement to take advantage of the power of eBPF and provide a consistent networking experience for their users across any cloud.
- Insiel wanted to update their software production methods and introduce a cloud native paradigm in their software production. Their digital transformation project with Kiratech and Microsoft Azure allowed them to develop a cloud-first culture.
Ecosystem updates
- KubeCon + CloudNativeCon Europe 2023 will take place in Amsterdam, The Netherlands, from 17 – 21 April 2023! You can find more information about the conference and registration on the event site.
- CloudNativeSecurityCon North America, a two-day event designed to foster collaboration, discussion and knowledge sharing of cloud native security projects and how to best use these to address security challenges and opportunities, will take place in Seattle, Washington (USA), from 1-2 February 2023. See the event page for more information.
- The CNCF announced the 2022 Community Awards Winners: the Community Awards recognize CNCF community members that are going above and beyond to advance cloud native technology.
Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.26 release cycle, which ran for 14 weeks (September 5 to December 9), we saw contributions from 976 companies and 6877 individuals.

Upcoming Release Webinar

Join members of the Kubernetes v1.26 release team on Tuesday January 17, 2023 10am - 11am EST (3pm - 4pm UTC) to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests.

Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:
- Find out more about contributing to Kubernetes at the Kubernetes Contributors website
- Follow us on Twitter @Kubernetesio for the latest updates
- Join the community discussion on Discuss
- Join the community on Slack
- Post questions (or answer questions) on Server Fault
- Share your Kubernetes story
- Read more about what’s happening with Kubernetes on the blog
- Learn more about the Kubernetes Release Team
https://kubernetes.io/blog/2022/12/09/kubernetes-v1-26-release/
Blog: Forensic container checkpointing in Kubernetes
• sorangutan

1

1
Posts

35
Views
Authors: Adrian Reber (Red Hat)

Forensic container checkpointing is based on Checkpoint/Restore In Userspace (CRIU) and allows the creation of stateful copies of a running container without the container knowing that it is being checkpointed. The copy of the container can be analyzed and restored in a sandbox environment multiple times without the original container being aware of it. Forensic container checkpointing was introduced as an alpha feature in Kubernetes v1.25.

How does it work?

With the help of CRIU it is possible to checkpoint and restore containers. CRIU is integrated in runc, crun, CRI-O and containerd and forensic container checkpointing as implemented in Kubernetes uses these existing CRIU integrations.

Why is it important?

With the help of CRIU and the corresponding integrations it is possible to get all information and state about a running container on disk for later forensic analysis. Forensic analysis might be important to inspect a suspicious container without stopping or influencing it. If the container is really under attack, the attacker might detect attempts to inspect the container. Taking a checkpoint and analysing the container in a sandboxed environment offers the possibility to inspect the container without the original container and maybe attacker being aware of the inspection.

In addition to the forensic container checkpointing use case, it is also possible to migrate a container from one node to another node without loosing the internal state. Especially for stateful containers with long initialization times restoring from a checkpoint might save time after a reboot or enable much faster startup times.

How do I use container checkpointing?

The feature is behind a feature gate, so make sure to enable the ContainerCheckpoint gate before you can use the new feature.

The runtime must also support container checkpointing:
- containerd: support is currently under discussion. See containerd pull request #6965 for more details.
- CRI-O: v1.25 has support for forensic container checkpointing.
Usage example with CRI-O

To use forensic container checkpointing in combination with CRI-O, the runtime needs to be started with the command-line option --enable-criu-support=true. For Kubernetes, you need to run your cluster with the ContainerCheckpoint feature gate enabled. As the checkpointing functionality is provided by CRIU it is also necessary to install CRIU. Usually runc or crun depend on CRIU and therefore it is installed automatically.

It is also important to mention that at the time of writing the checkpointing functionality is to be considered as an alpha level feature in CRI-O and Kubernetes and the security implications are still under consideration.

Once containers and pods are running it is possible to create a checkpoint. Checkpointing is currently only exposed on the kubelet level. To checkpoint a container, you can run curl on the node where that container is running, and trigger a checkpoint:
```
curl -X POST "https://localhost:10250/checkpoint/namespace/podId/container"
```
For a container named counter in a pod named counters in a namespace named default the kubelet API endpoint is reachable at:
```
curl -X POST "https://localhost:10250/checkpoint/default/counters/counter"
```
For completeness the following curl command-line options are necessary to have curl accept the kubelet's self signed certificate and authorize the use of the kubelet checkpoint API:
```
--insecure --cert /var/run/kubernetes/client-admin.crt --key /var/run/kubernetes/client-admin.key
```
Triggering this kubelet API will request the creation of a checkpoint from CRI-O. CRI-O requests a checkpoint from your low-level runtime (for example, runc). Seeing that request, runc invokes the criu tool to do the actual checkpointing.

Once the checkpointing has finished the checkpoint should be available at /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar

You could then use that tar archive to restore the container somewhere else.

Restore a checkpointed container outside of Kubernetes (with CRI-O)

With the checkpoint tar archive it is possible to restore the container outside of Kubernetes in a sandboxed instance of CRI-O. For better user experience during restore, I recommend that you use the latest version of CRI-O from the main CRI-O GitHub branch. If you're using CRI-O v1.25, you'll need to manually create certain directories Kubernetes would create before starting the container.

The first step to restore a container outside of Kubernetes is to create a pod sandbox using crictl:
```
crictl runp pod-config.json
```
Then you can restore the previously checkpointed container into the newly created pod sandbox:
```
crictl create <POD_ID> container-config.json pod-config.json
```
Instead of specifying a container image in a registry in container-config.json you need to specify the path to the checkpoint archive that you created earlier:
```
{
 "metadata": {
 "name": "counter"
 },
 "image":{
 "image": "/var/lib/kubelet/checkpoints/<checkpoint-archive>.tar"
 }
}
```
Next, run crictl start <CONTAINER_ID> to start that container, and then a copy of the previously checkpointed container should be running.

Restore a checkpointed container within of Kubernetes

To restore the previously checkpointed container directly in Kubernetes it is necessary to convert the checkpoint archive into an image that can be pushed to a registry.

One possible way to convert the local checkpoint archive consists of the following steps with the help of buildah:
```
newcontainer=$(buildah from scratch)
buildah add $newcontainer /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar /
buildah config --annotation=io.kubernetes.cri-o.annotations.checkpoint.name=<container-name> $newcontainer
buildah commit $newcontainer checkpoint-image:latest
buildah rm $newcontainer
```
The resulting image is not standardized and only works in combination with CRI-O. Please consider this image format as pre-alpha. There are ongoing discussions to standardize the format of checkpoint images like this. Important to remember is that this not yet standardized image format only works if CRI-O has been started with --enable-criu-support=true. The security implications of starting CRI-O with CRIU support are not yet clear and therefore the functionality as well as the image format should be used with care.

Now, you'll need to push that image to a container image registry. For example:
```
buildah push localhost/checkpoint-image:latest container-image-registry.example/user/checkpoint-image:latest
```
To restore this checkpoint image (container-image-registry.example/user/checkpoint-image:latest), the image needs to be listed in the specification for a Pod. Here's an example manifest:
```
apiVersion: v1
kind: Pod
metadata:
 namePrefix: example-
spec:
 containers:
 - name: <container-name>
 image: container-image-registry.example/user/checkpoint-image:latest
 nodeName: <destination-node>
```
Kubernetes schedules the new Pod onto a node. The kubelet on that node instructs the container runtime (CRI-O in this example) to create and start a container based on an image specified as registry/user/checkpoint-image:latest. CRI-O detects that registry/user/checkpoint-image:latest is a reference to checkpoint data rather than a container image. Then, instead of the usual steps to create and start a container, CRI-O fetches the checkpoint data and restores the container from that specified checkpoint.

The application in that Pod would continue running as if the checkpoint had not been taken; within the container, the application looks and behaves like any other container that had been started normally and not restored from a checkpoint.

With these steps, it is possible to replace a Pod running on one node with a new equivalent Pod that is running on a different node, and without losing the state of the containers in that Pod.

How do I get involved?

You can reach SIG Node by several means:
- Slack: #sig-node
- Mailing list
https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
Blog: Finding suspicious syscalls with the seccomp notifier
• sorangutan

1

1
Posts

30
Views
Authors: Sascha Grunert

Debugging software in production is one of the biggest challenges we have to face in our containerized environments. Being able to understand the impact of the available security options, especially when it comes to configuring our deployments, is one of the key aspects to make the default security in Kubernetes stronger. We have all those logging, tracing and metrics data already at hand, but how do we assemble the information they provide into something human readable and actionable?

Seccomp is one of the standard mechanisms to protect a Linux based Kubernetes application from malicious actions by interfering with its system calls. This allows us to restrict the application to a defined set of actionable items, like modifying files or responding to HTTP requests. Linking the knowledge of which set of syscalls is required to, for example, modify a local file, to the actual source code is in the same way non-trivial. Seccomp profiles for Kubernetes have to be written in JSON and can be understood as an architecture specific allow-list with superpowers, for example:
```
{
 "defaultAction": "SCMP_ACT_ERRNO",
 "defaultErrnoRet": 38,
 "defaultErrno": "ENOSYS",
 "syscalls": [
 {
 "names": ["chmod", "chown", "open", "write"],
 "action": "SCMP_ACT_ALLOW"
 }
 ]
}
```
The above profile errors by default specifying the defaultAction of SCMP_ACT_ERRNO. This means we have to allow a set of syscalls via SCMP_ACT_ALLOW, otherwise the application would not be able to do anything at all. Okay cool, for being able to allow file operations, all we have to do is adding a bunch of file specific syscalls like open or write, and probably also being able to change the permissions via chmod and chown, right? Basically yes, but there are issues with the simplicity of that approach:

Seccomp profiles need to include the minimum set of syscalls required to start the application. This also includes some syscalls from the lower level Open Container Initiative (OCI) container runtime, for example runc or crun. Beside that, we can only guarantee the required syscalls for a very specific version of the runtimes and our application, because the code parts can change between releases. The same applies to the termination of the application as well as the target architecture we're deploying on. Features like executing commands within containers also require another subset of syscalls. Not to mention that there are multiple versions for syscalls doing slightly different things and the seccomp profiles are able to modify their arguments. It's also not always clearly visible to the developers which syscalls are used by their own written code parts, because they rely on programming language abstractions or frameworks.

How can we know which syscalls are even required then? Who should create and maintain those profiles during its development life-cycle?

Well, recording and distributing seccomp profiles is one of the problem domains of the Security Profiles Operator, which is already solving that. The operator is able to record seccomp, SELinux and even AppArmor profiles into a Custom Resource Definition (CRD), reconciles them to each node and makes them available for usage.

The biggest challenge about creating security profiles is to catch all code paths which execute syscalls. We could achieve that by having 100% logical coverage of the application when running an end-to-end test suite. You get the problem with the previous statement: It's too idealistic to be ever fulfilled, even without taking all the moving parts during application development and deployment into account.

Missing a syscall in the seccomp profiles' allow list can have tremendously negative impact on the application. It's not only that we can encounter crashes, which are trivially detectable. It can also happen that they slightly change logical paths, change the business logic, make parts of the application unusable, slow down performance or even expose security vulnerabilities. We're simply not able to see the whole impact of that, especially because blocked syscalls via SCMP_ACT_ERRNO do not provide any additional audit logging on the system.

Does that mean we're lost? Is it just not realistic to dream about a Kubernetes where everyone uses the default seccomp profile? Should we stop striving towards maximum security in Kubernetes and accept that it's not meant to be secure by default?

Definitely not. Technology evolves over time and there are many folks working behind the scenes of Kubernetes to indirectly deliver features to address such problems. One of the mentioned features is the seccomp notifier, which can be used to find suspicious syscalls in Kubernetes.

The seccomp notify feature consists of a set of changes introduced in Linux 5.9. It makes the kernel capable of communicating seccomp related events to the user space. That allows applications to act based on the syscalls and opens for a wide range of possible use cases. We not only need the right kernel version, but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier work at all. The Kubernetes container runtime CRI-O gets support for the seccomp notifier in v1.26.0. The new feature allows us to identify possibly malicious syscalls in our application, and therefore makes it possible to verify profiles for consistency and completeness. Let's give that a try.

First of all we need to run the latest main version of CRI-O, because v1.26.0 has not been released yet at time of writing. You can do that by either compiling it from the source code or by using the pre-built binary bundle via the get-script. The seccomp notifier feature of CRI-O is guarded by an annotation, which has to be explicitly allowed, for example by using a configuration drop-in like this:
```
> cat /etc/crio/crio.conf.d/02-runtimes.conf
```
```
[crio.runtime]
default_runtime = "runc"

[crio.runtime.runtimes.runc]
allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ]
```
If CRI-O is up and running, then it should indicate that the seccomp notifier is available as well:
```
> sudo ./bin/crio --enable-metrics
…
INFO[…] Starting seccomp notifier watcher
INFO[…] Serving metrics on :9090 via HTTP
…
```
We also enable the metrics, because they provide additional telemetry data about the notifier. Now we need a running Kubernetes cluster for demonstration purposes. For this demo, we mainly stick to the hack/local-up-cluster.sh approach to locally spawn a single node Kubernetes cluster.

If everything is up and running, then we would have to define a seccomp profile for testing purposes. But we do not have to create our own, we can just use the RuntimeDefault profile which gets shipped with each container runtime. For example the RuntimeDefault profile for CRI-O can be found in the containers/common library.

Now we need a test container, which can be a simple nginx pod like this:
```
apiVersion: v1
kind: Pod
metadata:
 name: nginx
 annotations:
 io.kubernetes.cri-o.seccompNotifierAction: "stop"
spec:
 restartPolicy: Never
 containers:
 - name: nginx
 image: nginx:1.23.2
 securityContext:
 seccompProfile:
 type: RuntimeDefault
```
Please note the annotation io.kubernetes.cri-o.seccompNotifierAction, which enables the seccomp notifier for this workload. The value of the annotation can be either stop for stopping the workload or anything else for doing nothing else than logging and throwing metrics. Because of the termination we also use the restartPolicy: Never to not automatically recreate the container on failure.

Let's run the pod and check if it works:
```
> kubectl apply -f nginx.yaml
```
```
> kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 3m39s 10.85.0.3 127.0.0.1 <none> <none>
```
We can also test if the web server itself works as intended:
```
> curl 10.85.0.3
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
…
```
While everything is now up and running, CRI-O also indicates that it has started the seccomp notifier:
```
…
INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb
…
```
If we would now run a forbidden syscall inside of the container, then we can expect that the workload gets terminated. Let's give that a try by running chroot in the containers namespaces:
```
> kubectl exec -it nginx -- bash
```
```
root@nginx:/# chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
root@nginx:/# command terminated with exit code 137
```
The exec session got terminated, so it looks like the container is not running any more:
```
> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 seccomp killed 0 96s
```
Alright, the container got killed by seccomp, do we get any more information about what was going on?
```
> kubectl describe pod nginx
Name: nginx
…
Containers:
 nginx:
 …
 State: Terminated
 Reason: seccomp killed
 Message: Used forbidden syscalls: chroot (1x)
 Exit Code: 137
 Started: Mon, 14 Nov 2022 12:19:46 +0100
 Finished: Mon, 14 Nov 2022 12:20:26 +0100
…
```
The seccomp notifier feature of CRI-O correctly set the termination reason and message, including which forbidden syscall has been used how often (1x). How often? Yes, the notifier gives the application up to 5 seconds after the last seen syscall until it starts the termination. This means that it's possible to catch multiple forbidden syscalls within one test by avoiding time-consuming trial and errors.
```
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
```
```
> kubectl describe pod nginx | grep Message
 Message: Used forbidden syscalls: chroot (2x), swapoff (2x)
```
The CRI-O metrics will also reflect that:
```
> curl -sf localhost:9090/metrics | grep seccomp_notifier
# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1
```
How does it work in detail? CRI-O uses the chosen seccomp profile and injects the action SCMP_ACT_NOTIFY instead of SCMP_ACT_ERRNO, SCMP_ACT_KILL, SCMP_ACT_KILL_PROCESS or SCMP_ACT_KILL_THREAD. It also sets a local listener path which will be used by the lower level OCI runtime (runc or crun) to create the seccomp notifier socket. If the connection between the socket and CRI-O has been established, then CRI-O will receive notifications for each syscall being interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for them to arrive and then terminates the container if the chosen seccompNotifierAction=stop. Unfortunately, the seccomp notifier is not able to notify on the defaultAction, which means that it's required to have a list of syscalls to test for custom profiles. CRI-O does also state that limitation in the logs:
```
INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
which means that syscalls using that default action can't be traced by the notifier
```
As a conclusion, the seccomp notifier implementation in CRI-O can be used to verify if your applications behave correctly when using RuntimeDefault or any other custom profile. Alerts can be created based on the metrics to create long running test scenarios around that feature. Making seccomp understandable and easier to use will increase adoption as well as help us to move towards a more secure Kubernetes by default!

Thank you for reading this blog post. If you'd like to read more about the seccomp notifier, checkout the following resources:
- The Seccomp Notifier - New Frontiers in Unprivileged Container Development: https://brauner.io/2020/07/23/seccomp-notify.html
- Bringing Seccomp Notify to Runc and Kubernetes: https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-to-runc-and-kubernetes
- Seccomp Agent reference implementation: https://github.com/opencontainers/runc/tree/6b16d00/contrib/cmd/seccompagent
https://kubernetes.io/blog/2022/12/02/seccomp-notifier/
Blog: Boosting Kubernetes container runtime observability with OpenTelemetry
• sorangutan

1

1
Posts

27
Views
Authors: Sascha Grunert

When speaking about observability in the cloud native space, then probably everyone will mention OpenTelemetry (OTEL) at some point in the conversation. That's great, because the community needs standards to rely on for developing all cluster components into the same direction. OpenTelemetry enables us to combine logs, metrics, traces and other contextual information (called baggage) into a single resource. Cluster administrators or software engineers can use this resource to get a viewport about what is going on in the cluster over a defined period of time. But how can Kubernetes itself make use of this technology stack?

Kubernetes consists of multiple components where some are independent and others are stacked together. Looking at the architecture from a container runtime perspective, then there are from the top to the bottom:
- kube-apiserver: Validates and configures data for the API objects
- kubelet: Agent running on each node
- CRI runtime: Container Runtime Interface (CRI) compatible container runtime like CRI-O or containerd
- OCI runtime: Lower level Open Container Initiative (OCI) runtime like runc or crun
- Linux kernel or Microsoft Windows: Underlying operating system
That means if we encounter a problem with running containers in Kubernetes, then we start looking at one of those components. Finding the root cause for problems is one of the most time consuming actions we face with the increased architectural complexity from today's cluster setups. Even if we know the component which seems to cause the issue, we still have to take the others into account to maintain a mental timeline of events which are going on. How do we achieve that? Well, most folks will probably stick to scraping logs, filtering them and assembling them together over the components borders. We also have metrics, right? Correct, but bringing metrics values in correlation with plain logs makes it even harder to track what is going on. Some metrics are also not made for debugging purposes. They have been defined based on the end user perspective of the cluster for linking usable alerts and not for developers debugging a cluster setup.

OpenTelemetry to the rescue: the project aims to combine signals such as traces, metrics and logs together to maintain the right viewport on the cluster state.

What is the current state of OpenTelemetry tracing in Kubernetes? From an API server perspective, we have alpha support for tracing since Kubernetes v1.22, which will graduate to beta in one of the upcoming releases. Unfortunately the beta graduation has missed the v1.26 Kubernetes release. The design proposal can be found in the API Server Tracing Kubernetes Enhancement Proposal (KEP) which provides more information about it.

The kubelet tracing part is tracked in another KEP, which was implemented in an alpha state in Kubernetes v1.25. A beta graduation is not planned as time of writing, but more may come in the v1.27 release cycle. There are other side-efforts going on beside both KEPs, for example klog is considering OTEL support, which would boost the observability by linking log messages to existing traces. Within SIG Instrumentation and SIG Node, we're also discussing how to link the kubelet traces together, because right now they're focused on the gRPC calls between the kubelet and the CRI container runtime.

CRI-O features OpenTelemetry tracing support since v1.23.0 and is working on continuously improving them, for example by attaching the logs to the traces or extending the spans to logical parts of the application. This helps users of the traces to gain the same information like parsing the logs, but with enhanced capabilities of scoping and filtering to other OTEL signals. The CRI-O maintainers are also working on a container monitoring replacement for conmon, which is called conmon-rs and is purely written in Rust. One benefit of having a Rust implementation is to be able to add features like OpenTelemetry support, because the crates (libraries) for those already exist. This allows a tight integration with CRI-O and lets consumers see the most low level tracing data from their containers.

The containerd folks added tracing support since v1.6.0, which is available by using a plugin. Lower level OCI runtimes like runc or crun feature no support for OTEL at all and it does not seem to exist a plan for that. We always have to consider that there is a performance overhead when collecting the traces as well as exporting them to a data sink. I still think it would be worth an evaluation on how extended telemetry collection could look like in OCI runtimes. Let's see if the Rust OCI runtime youki is considering something like that in the future.

I'll show you how to give it a try. For my demo I'll stick to a stack with a single local node that has runc, conmon-rs, CRI-O, and a kubelet. To enable tracing in the kubelet, I need to apply the following KubeletConfiguration:
```
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
 KubeletTracing: true
tracing:
 samplingRatePerMillion: 1000000
```
A samplingRatePerMillion equally to one million will internally translate to sampling everything. A similar configuration has to be applied to CRI-O; I can either start the crio binary with --enable-tracing and --tracing-sampling-rate-per-million 1000000 or we use a drop-in configuration like this:
```
cat /etc/crio/crio.conf.d/99-tracing.conf
```
```
[crio.tracing]
enable_tracing = true
tracing_sampling_rate_per_million = 1000000
```
To configure CRI-O to use conmon-rs, you require at least the latest CRI-O v1.25.x and conmon-rs v0.4.0. Then a configuration drop-in like this can be used to make CRI-O use conmon-rs:
```
cat /etc/crio/crio.conf.d/99-runtimes.conf
```
```
[crio.runtime]
default_runtime = "runc"

[crio.runtime.runtimes.runc]
runtime_type = "pod"
monitor_path = "/path/to/conmonrs" # or will be looked up in $PATH
```
That's it, the default configuration will point to an OpenTelemetry collector gRPC endpoint of localhost:4317, which has to be up and running as well. There are multiple ways to run OTLP as described in the docs, but it's also possible to kubectl proxy into an existing instance running within Kubernetes.

If everything is set up, then the collector should log that there are incoming traces:
```
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope go.opentelemetry.io/otel/sdk/tracer
Span #0
Trace ID : 71896e69f7d337730dfedb6356e74f01
Parent ID : a2a7714534c017e6
ID : 1d27dbaf38b9da8b
Name : github.com/cri-o/cri-o/server.(*Server).filterSandboxList
Kind : SPAN_KIND_INTERNAL
Start time : 2022-11-15 09:50:20.060325562 +0000 UTC
End time : 2022-11-15 09:50:20.060326291 +0000 UTC
Status code : STATUS_CODE_UNSET
Status message :
Span #1
Trace ID : 71896e69f7d337730dfedb6356e74f01
Parent ID : a837a005d4389579
ID : a2a7714534c017e6
Name : github.com/cri-o/cri-o/server.(*Server).ListPodSandbox
Kind : SPAN_KIND_INTERNAL
Start time : 2022-11-15 09:50:20.060321973 +0000 UTC
End time : 2022-11-15 09:50:20.060330602 +0000 UTC
Status code : STATUS_CODE_UNSET
Status message :
Span #2
Trace ID : fae6742709d51a9b6606b6cb9f381b96
Parent ID : 3755d12b32610516
ID : 0492afd26519b4b0
Name : github.com/cri-o/cri-o/server.(*Server).filterContainerList
Kind : SPAN_KIND_INTERNAL
Start time : 2022-11-15 09:50:20.0607746 +0000 UTC
End time : 2022-11-15 09:50:20.060795505 +0000 UTC
Status code : STATUS_CODE_UNSET
Status message :
Events:
SpanEvent #0
-> Name: log
-> Timestamp: 2022-11-15 09:50:20.060778668 +0000 UTC
-> DroppedAttributesCount: 0
-> Attributes::
-> id: Str(adf791e5-2eb8-4425-b092-f217923fef93)
-> log.message: Str(No filters were applied, returning full container list)
-> log.severity: Str(DEBUG)
-> name: Str(/runtime.v1.RuntimeService/ListContainers)
```
I can see that the spans have a trace ID and typically have a parent attached. Events such as logs are part of the output as well. In the above case, the kubelet is periodically triggering a ListPodSandbox RPC to CRI-O caused by the Pod Lifecycle Event Generator (PLEG). Displaying those traces can be done via, for example, Jaeger. When running the tracing stack locally, then a Jaeger instance should be exposed on http://localhost:16686 per default.

The ListPodSandbox requests are directly visible within the Jaeger UI:

That's not too exciting, so I'll run a workload directly via kubectl:
```
kubectl run -it --rm --restart=Never --image=alpine alpine -- echo hi
```
```
hi
pod "alpine" deleted
```
Looking now at Jaeger, we can see that we have traces for conmonrs, crio as well as the kubelet for the RunPodSandbox and CreateContainer CRI RPCs:

The kubelet and CRI-O spans are connected to each other to make investigation easier. If we now take a closer look at the spans, then we can see that CRI-O's logs are correctly accosted with the corresponding functionality. For example we can extract the container user from the traces like this:

The lower level spans of conmon-rs are also part of this trace. For example conmon-rs maintains an internal read_loop for handling IO between the container and the end user. The logs for reading and writing bytes are part of the span. The same applies to the wait_for_exit_code span, which tells us that the container exited successfully with code 0:

Having all that information at hand side by side to the filtering capabilities of Jaeger makes the whole stack a great solution for debugging container issues! Mentioning the "whole stack" also shows the biggest downside of the overall approach: Compared to parsing logs it adds a noticeable overhead on top of the cluster setup. Users have to maintain a sink like Elasticsearch to persist the data, expose the Jaeger UI and possibly take the performance drawback into account. Anyways, it's still one of the best ways to increase the observability aspect of Kubernetes.

Thank you for reading this blog post, I'm pretty sure we're looking into a bright future for OpenTelemetry support in Kubernetes to make troubleshooting simpler.

https://kubernetes.io/blog/2022/12/01/runtime-observability-opentelemetry/
Blog: registry.k8s.io: faster, cheaper and Generally Available (GA)
• sorangutan

1

1
Posts

11
Views
Authors: Adolfo García Veytia (Chainguard), Bob Killen (Google)

Starting with Kubernetes 1.25, our container image registry has changed from k8s.gcr.io to registry.k8s.io. This new registry spreads the load across multiple Cloud Providers & Regions, functioning as a sort of content delivery network (CDN) for Kubernetes container images. This change reduces the project’s reliance on a single entity and provides a faster download experience for a large number of users.

TL;DR: What you need to know about this change
- Container images for Kubernetes releases from 1.25 onward are no longer published to k8s.gcr.io, only to registry.k8s.io.
- In the upcoming December patch releases, the new registry domain default will be backported to all branches still in support (1.22, 1.23, 1.24).
- If you run in a restricted environment and apply strict domain/IP address access policies limited to k8s.gcr.io, the image pulls will not function after the migration to this new registry. For these users, the recommended method is to mirror the release images to a private registry.
If you’d like to know more about why we made this change, or some potential issues you might run into, keep reading.

Why has Kubernetes changed to a different image registry?

k8s.gcr.io is hosted on a custom Google Container Registry (GCR) domain that was setup solely for the Kubernetes project. This has worked well since the inception of the project, and we thank Google for providing these resources, but today there are other cloud providers and vendors that would like to host images to provide a better experience for the people on their platforms. In addition to Google’s renewed commitment to donate $3 million to support the project's infrastructure, Amazon announced a matching donation during their Kubecon NA 2022 keynote in Detroit. This will provide a better experience for users (closer servers = faster downloads) and will reduce the egress bandwidth and costs from GCR at the same time. registry.k8s.io will spread the load between Google and Amazon, with other providers to follow in the future.

Why isn’t there a stable list of domains/IPs? Why can’t I restrict image pulls?

registry.k8s.io is a secure blob redirector that connects clients to the closest cloud provider. The nature of this change means that a client pulling an image could be redirected to any one of a large number of backends. We expect the set of backends to keep changing and will only increase as more and more cloud providers and vendors come on board to help mirror the release images.

Restrictive control mechanisms like man-in-the-middle proxies or network policies that restrict access to a specific list of IPs/domains will break with this change. For these scenarios, we encourage you to mirror the release images to a local registry that you have strict control over.

For more information on this policy, please see the stability section of the registry.k8s.io documentation.

What kind of errors will I see? How will I know if I’m still using the old address?

Errors may depend on what kind of container runtime you are using, and what endpoint you are routed to, but it should present as a container failing to be created with the warning FailedCreatePodSandBox.

Below is an example error message showing a proxied deployment failing to pull due to an unknown certificate:
```
FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: Head “https://us-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8”: x509: certificate signed by unknown authority
```
I’m impacted by this change, how do I revert to the old registry address?

If using the new registry domain name is not an option, you can revert to the old domain name for cluster versions less than 1.25. Keep in mind that, eventually, you will have to switch to the new registry, as new image tags will no longer be pushed to GCR.

Reverting the registry name in kubeadm

The registry used by kubeadm to pull its images can be controlled by two methods:

Setting the --image-repository flag.
```
kubeadm init --image-repository=k8s.gcr.io
```
Or in kubeadm config ClusterConfiguration:
```
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
imageRepository: "k8s.gcr.io"
```
Reverting the Registry Name in kubelet

The image used by kubelet for the pod sandbox (pause) can be overridden by setting the --pod-infra-container-image flag. For example:
```
kubelet --pod-infra-container-image=k8s.gcr.io/pause:3.5
```
Acknowledgments

Change is hard, and evolving our image-serving platform is needed to ensure a sustainable future for the project. We strive to make things better for everyone using Kubernetes. Many contributors from all corners of our community have been working long and hard to ensure we are making the best decisions possible, executing plans, and doing our best to communicate those plans.

Thanks to Aaron Crickenberger, Arnaud Meukam, Benjamin Elder, Caleb Woodbine, Davanum Srinivas, Mahamed Ali, and Tim Hockin from SIG K8s Infra, Brian McQueen, and Sergey Kanzhelev from SIG Node, Lubomir Ivanov from SIG Cluster Lifecycle, Adolfo García Veytia, Jeremy Rickard, Sascha Grunert, and Stephen Augustus from SIG Release, Bob Killen and Kaslin Fields from SIG Contribex, Tim Allclair from the Security Response Committee. Also a big thank you to our friends acting as liaisons with our cloud provider partners: Jay Pipes from Amazon and Jon Johnson Jr. from Google.

https://kubernetes.io/blog/2022/11/28/registry-k8s-io-faster-cheaper-ga/
Blog: Kubernetes Removals, Deprecations, and Major Changes in 1.26
• sorangutan

1

1
Posts

12
Views
Author: Frederico Muñoz (SAS)

Change is an integral part of the Kubernetes life-cycle: as Kubernetes grows and matures, features may be deprecated, removed, or replaced with improvements for the health of the project. For Kubernetes v1.26 there are several planned: this article identifies and describes some of them, based on the information available at this mid-cycle point in the v1.26 release process, which is still ongoing and can introduce additional changes.

The Kubernetes API Removal and Deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API is one that has been marked for removal in a future Kubernetes release; it will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.
- Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
- Beta or pre-release API versions must be supported for 3 releases after deprecation.
- Alpha or experimental API versions may be removed in any release without prior deprecation notice.
Whether an API is removed as a result of a feature graduating from beta to stable or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the documentation.

A note about the removal of the CRI v1alpha2 API and containerd 1.5 support

Following the adoption of the Container Runtime Interface (CRI) and the [removal of dockershim] in v1.24 , the CRI is the supported and documented way through which Kubernetes interacts withdifferent container runtimes. Each kubelet negotiates which version of CRI to use with the container runtime on that node.

The Kubernetes project recommends using CRI version v1; in Kubernetes v1.25 the kubelet can also negotiate the use of CRI v1alpha2 (which was deprecated along at the same time as adding support for the stable v1 interface).

Kubernetes v1.26 will not support CRI v1alpha2. That removal will result in the kubelet not registering the node if the container runtime doesn't support CRI v1. This means that containerd minor version 1.5 and older will not be supported in Kubernetes 1.26; if you use containerd, you will need to upgrade to containerd version 1.6.0 or later before you upgrade that node to Kubernetes v1.26. Other container runtimes that only support the v1alpha2 are equally affected: if that affects you, you should contact the container runtime vendor for advice or check their website for additional instructions in how to move forward.

If you want to benefit from v1.26 features and still use an older container runtime, you can run an older kubelet. The supported skew for the kubelet allows you to run a v1.25 kubelet, which still is still compatible with v1alpha2 CRI support, even if you upgrade the control plane to the 1.26 minor release of Kubernetes.

As well as container runtimes themselves, that there are tools like stargz-snapshotter that act as a proxy between kubelet and container runtime and those also might be affected.

Deprecations and removals in Kubernetes v1.26

In addition to the above, Kubernetes v1.26 is targeted to include several additional removals and deprecations.

Removal of the v1beta1 flow control API group

The flowcontrol.apiserver.k8s.io/v1beta1 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.26. Users should migrate manifests and API clients to use the flowcontrol.apiserver.k8s.io/v1beta2 API version, available since v1.23.

Removal of the v2beta2 HorizontalPodAutoscaler API

The autoscaling/v2beta2 API version of HorizontalPodAutoscaler will no longer be served in v1.26. Users should migrate manifests and API clients to use the autoscaling/v2 API version, available since v1.23.

Removal of in-tree credential management code

In this upcoming release, legacy vendor-specific authentication code that is part of Kubernetes will be removed from both client-go and kubectl. The existing mechanism supports authentication for two specific cloud providers: Azure and Google Cloud. In its place, Kubernetes already offers a vendor-neutral authentication plugin mechanism - you can switch over right now, before the v1.26 release happens. If you're affected, you can find additional guidance on how to proceed for Azure and for Google Cloud.

Removal of kube-proxy userspace modes

The userspace proxy mode, deprecated for over a year, is no longer supported on either Linux or Windows and will be removed in this release. Users should use iptables or ipvs on Linux, or kernelspace on Windows: using --mode userspace will now fail.

Removal of in-tree OpenStack cloud provider

Kubernetes is switching from in-tree code for storage integrations, in favor of the Container Storage Interface (CSI). As part of this, Kubernetes v1.26 will remove the the deprecated in-tree storage integration for OpenStack (the cinder volume type). You should migrate to external cloud provider and CSI driver from https://github.com/kubernetes/cloud-provider-openstack instead. For more information, visit Cinder in-tree to CSI driver migration.

Removal of the GlusterFS in-tree driver

The in-tree GlusterFS driver was deprecated in v1.25, and will be removed from Kubernetes v1.26.

Deprecation of non-inclusive kubectl flag

As part of the implementation effort of the Inclusive Naming Initiative, the --prune-whitelist flag will be deprecated, and replaced with --prune-allowlist. Users that use this flag are strongly advised to make the necessary changes prior to the final removal of the flag, in a future release.

Removal of dynamic kubelet configuration

Dynamic kubelet configuration allowed new kubelet configurations to be rolled out via the Kubernetes API, even in a live cluster. A cluster operator could reconfigure the kubelet on a Node by specifying a ConfigMap that contained the configuration data that the kubelet should use. Dynamic kubelet configuration was removed from the kubelet in v1.24, and will be removed from the API server in the v1.26 release.

Deprecations for kube-apiserver command line arguments

The --master-service-namespace command line argument to the kube-apiserver doesn't have any effect, and was already informally deprecated. That command line argument wil be formally marked as deprecated in v1.26, preparing for its removal in a future release. The Kubernetes project does not expect any impact from this deprecation and removal.

Deprecations for kubectl run command line arguments

Several unused option arguments for the kubectl run subcommand will be marked as deprecated, including:
- --cascade
- --filename
- --force
- --grace-period
- --kustomize
- --recursive
- --timeout
- --wait
These arguments are already ignored so no impact is expected: the explicit deprecation sets a warning message and prepares the removal of the argumentsin a future release.

Removal of legacy command line arguments relating to logging

Kubernetes v1.26 will remove some command line arguments relating to logging. These command line arguments were already deprecated. For more information, see [Deprecate klog specific flags in Kubernetes Components] (https://github.com/kubernetes/enhancements/tree/3cb66bd0a1ef973ebcc974f935f0ac5cba9db4b2/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components).

Looking ahead

The official list of API removals) planned for Kubernetes 1.27 includes:
- All beta versions of the CSIStorageCapacity API; specifically: storage.k8s.io/v1beta1
Want to know more?

Deprecations are announced in the Kubernetes release notes. You can see the announcements of pending deprecations in the release notes for:
We will formally announce the deprecations that come with Kubernetes 1.26 as part of the CHANGELOG for that release.

https://kubernetes.io/blog/2022/11/18/upcoming-changes-in-kubernetes-1-26/
Blog: Live and let live with Kluctl and Server Side Apply
• sorangutan

1

1
Posts

23
Views
Author: Alexander Block

This blog post was inspired by a previous Kubernetes blog post about Advanced Server Side Apply. The author of said blog post listed multiple benefits for applications and controllers when switching to server-side apply (from now on abbreviated with SSA). Especially the chapter about CI/CD systems motivated me to respond and write down my thoughts and experiences.

These thoughts and experiences are the results of me working on Kluctl for the past 2 years. I describe Kluctl as "The missing glue to put together large Kubernetes deployments, composed of multiple smaller parts (Helm/Kustomize/...) in a manageable and unified way."

To get a basic understanding of Kluctl, I suggest to visit the kluctl.io website and read through the documentation and tutorials, for example the microservices demo tutorial. As an alternative, you can watch Hands-on Introduction to kluctl from the Rawkode Academy YouTube channel which shows a hands-on demo session.

There is also a Kluctl delivery scenario available in my fork of the podtato-head demo project.

Live and let live

One of the main philosophies that Kluctl follows is "live and let live", meaning that it will try its best to work in conjunction with any other tool or controller running outside or inside your clusters. Kluctl will not overwrite any fields that it lost ownership of, unless you explicitly tell it to do so.

Achieving this would not have been possible (or at least several magnitudes harder) without the use of SSA. Server-side apply allows Kluctl to detect when ownership for a field got lost, for example when another controller or operator updates that field to another value. Kluctl can then decide on a field-by-field basis if force-applying is required before retrying based on these decisions.

The days before SSA

The first versions of Kluctl were based on shelling out to kubectl and thus implicitly relied on client-side apply. At that time, SSA was still alpha and quite buggy. And to be honest, I didn't even know it was a thing at that time.

The way client-side apply worked had some serious drawbacks. The most obvious one (it was guaranteed that you'd stumble on this by yourself if enough time passed) is that it relied on an annotation (kubectl.kubernetes.io/last-applied-configuration) being added to the object, bringing in all the limitations and issues with huge annotation values. A good example of such issues are CRDs being so large, that they don't fit into the annotation's value anymore.

Another drawback can be seen just by looking at the name (client-side apply). Being client side means that each client has to provide the apply-logic on its own, which at that time was only properly implemented inside kubectl, making it hard to be replicated inside controllers.

This added kubectl as a dependency (either as an executable or in the form of Go packages) to all controllers that wanted to leverage the apply-logic.

However, even if one managed to get client-side apply running from inside a controller, you ended up with a solution that gave no control over how it worked internally. As an example, there was no way to individually decide which fields to overwrite in case of external changes and which ones to let go.

Discovering SSA apply

I was never happy with the solution described above and then somehow stumbled across server-side apply, which was still in beta at that time. Experimenting with it via kubectl apply --server-side revealed immediately that the true power of SSA can not be easily leveraged by shelling out to kubectl.

The way SSA is implemented in kubectl does not allow enough control over conflict resolution as it can only switch between "not force-applying anything and erroring out" and "force-applying everything without showing any mercy!".

The API documentation however made it clear that SSA is able to control conflict resolution on field level, simply by choosing which fields to include and which fields to omit from the supplied object.

Moving away from kubectl

This meant that Kluctl had to move away from shelling out to kubectl first. Only after that was done, I would have been able to properly implement SSA with its powerful conflict resolution.

To achieve this, I first implemented access to the target clusters via a Kubernetes client library. This had the nice side effect of dramatically speeding up Kluctl as well. It also improved the security and usability of Kluctl by ensuring that a running Kluctl command could not be messed around with by externally modifying the kubeconfig while it was running.

Implementing SSA

After switching to a Kubernetes client library, leveraging SSA felt easy. Kluctl now has to send each manifest to the API server as part of a PATCH request, which signals that Kluctl wants to perform a SSA operation. The API server then responds with an OK response (HTTP status code 200), or with a Conflict response (HTTP status 409).

In case of a Conflict response, the body of that response includes machine-readable details about the conflicts. Kluctl can then use these details to figure out which fields are in conflict and which actors (field managers) have taken ownership of the conflicted fields.

Then, for each field, Kluctl will decide if the conflict should be ignored or if it should be force-applied. If any field needs to be force-applied, Kluctl will retry the apply operation with the ignored fields omitted and the force flag being set on the API call.

In case a conflict is ignored, Kluctl will issue a warning to the user so that the user can react properly (or ignore it forever...).

That's basically it. That is all that is required to leverage SSA. Big thanks and thumbs-up to the Kubernetes developers who made this possible!

Conflict Resolution

Kluctl has a few simple rules to figure out if a conflict should be ignored or force-applied.

It first checks the field's actor (the field manager) against a list of known field manager strings from tools that are frequently used to perform manual modifications. These are for example kubectl and k9s. Any modifications performed with these tools are considered "temporary" and will be overwritten by Kluctl.

If you're using Kluctl along with kubectl where you don't want the changes from kubectl to be overwritten (for example, using in a script) then you can specify --field-manager=<manager-name> on the command line to kubectl, and Kluctl doesn't apply its special heuristic.

If the field manager is not known by Kluctl, it will check if force-applying is requested for that field. Force-applying can be requested in different ways:
1. By passing --force-apply to Kluctl. This will cause ALL fields to be force-applied on conflicts.
2. By adding the kluctl.io/force-apply=true annotation to the object in question. This will cause all fields of that object to be force-applied on conflicts.
3. By adding the kluctl.io/force-apply-field=my.json.path annotation to the object in question. This causes only fields matching the JSON path to be force-applied on conflicts.
Marking a field to be force-applied is required whenever some other actor is known to erroneously claim fields (the ECK operator does this to the nodeSets field for example), you can ensure that Kluctl always overwrites these fields to the original or a new value.

In the future, Kluctl will allow even more control about conflict resolution. For example, the CLI will allow to control force-applying on field level.

DevOps vs Controllers

So how does SSA in Kluctl lead to "live and let live"?

It allows the co-existence of classical pipelines (e.g. Github Actions or Gitlab CI), controllers (e.g. the HPA controller or GitOps style controllers) and even admins running deployments from their local machines.

Wherever you are on your infrastructure automation journey, Kluctl has a place for you. From running deployments using a script on your PC, all the way to fully automated CI/CD with the pipelines themselves defined in code, Kluctl aims to complement the workflow that's right for you.

And even after fully automating everything, you can intervene with your admin permissions if required and run a kubectl command that will modify a field and prevent Kluctl from overwriting it. You'd just have to switch to a field-manager (e.g. "admin-override") that is not overwritten by Kluctl.

A few takeaways

Server-side apply is a great feature and essential for the future of controllers and tools in Kubernetes. The amount of controllers involved will only get more and proper modes of working together are a must.

I believe that CI/CD-related controllers and tools should leverage SSA to perform proper conflict resolution. I also believe that other controllers (e.g. Flux and ArgoCD) would benefit from the same kind of conflict resolution control on field-level.

It might even be a good idea to come together and work on a standardized set of annotations to control conflict resolution for CI/CD-related tooling.

On the other side, non CI/CD-related controllers should ensure that they don't cause unnecessary conflicts when modifying objects. As of the server-side apply documentation, it is strongly recommended for controllers to always perform force-applying. When following this recommendation, controllers should really make sure that only fields related to the controller are included in the applied object. Otherwise, unnecessary conflicts are guaranteed.

In many cases, controllers are meant to only modify the status subresource of the objects they manage. In this case, controllers should only patch the status subresource and not touch the actual object. If this is followed, conflicts become impossible to occur.

If you are a developer of such a controller and unsure about your controller adhering to the above, simply try to retrieve an object managed by your controller and look at the managedFields (you'll need to pass --show-managed-fields -oyaml to kubectl get) to see if some field got claimed unexpectedly.

https://kubernetes.io/blog/2022/11/04/live-and-let-live-with-kluctl-and-ssa/
Blog: Server Side Apply Is Great And You Should Be Using It
• sorangutan

1

1
Posts

38
Views
Author: Daniel Smith (Google)

Server-side apply (SSA) has now been GA for a few releases, and I have found myself in a number of conversations, recommending that people / teams in various situations use it. So I’d like to write down some of those reasons.

Obvious (and not-so-obvious) benefits of SSA

A list of improvements / niceties you get from switching from various things to Server-side apply!
- Versus client-side-apply (that is, plain kubectl apply):
  
  The system gives you conflicts when you accidentally fight with another actor over the value of a field!
  
  When combined with --dry-run, there’s no chance of accidentally running a client-side dry run instead of a server side dry run.
- Versus hand-rolling patches:
  
  The SSA patch format is extremely natural to write, with no weird syntax. It’s just a regular object, but you can (and should) omit any field you don’t care about.
  
  The old patch format (“strategic merge patch”) was ad-hoc and still has some bugs; JSON-patch and JSON merge-patch fail to handle some cases that are common in the Kubernetes API, namely lists with items that should be recursively merged based on a “name” or other identifying field.
  
  There’s also now great go-language library support for building apply calls programmatically!
  
  You can use SSA to explicitly delete fields you don’t “own” by setting them to null, which makes it a feature-complete replacement for all of the old patch formats.
- Versus shelling out to kubectl:
  
  You can use the apply API call from any language without shelling out to kubectl!
  
  As stated above, the Go library has dedicated mechanisms to make this easy now.
- Versus GET-modify-PUT:
  
  (This one is more complicated and you can skip it if you've never written a controller!)
  
  To use GET-modify-PUT correctly, you have to handle and retry a write failure in the case that someone else has modified the object in any way between your GET and PUT. This is an “optimistic concurrency failure” when it happens.
  
  SSA offloads this task to the server– you only have to retry if there’s a conflict, and the conflicts you can get are all meaningful, like when you’re actually trying to take a field away from another actor in the system.
  
  To put it another way, if 10 actors do a GET-modify-PUT cycle at the same time, 9 will get an optimistic concurrency failure and have to retry, then 8, etc, for up to 50 total GET-PUT attempts in the worst case (that’s .5N^2 GET and PUT calls for N actors making simultaneous changes). If the actors are using SSA instead, and the changes don’t actually conflict over specific fields, then all the changes can go in in any order. Additionally, SSA changes can often be done without a GET call at all. That’s only N apply requests for N actors, which is a drastic improvement!
How can I use SSA?

Users

Use kubectl apply --server-side! Soon we (SIG API Machinery) hope to make this the default and remove the “client side” apply completely!

Controller authors

There’s two main categories here, but for both of them, you should probably force conflicts when using SSA. This is because your controller probably doesn’t know what to do when some other entity in the system has a different desire than your controller about a particular field. (See the CI/CD section, though!)

Controllers that use either a GET-modify-PUT sequence or a PATCH

This kind of controller GETs an object (possibly from a watch), modifies it, and then PUTs it back to write its changes. Sometimes it constructs a custom PATCH, but the semantics are the same. Most existing controllers (especially those in-tree) work like this.

If your controller is perfect, great! You don’t need to change it. But if you do want to change it, you can take advantage of the new client library’s extract workflow– that is, get the existing object, extract your existing desires, make modifications, and re-apply. For many controllers that were computing the smallest API changes possible, this will be a minor update to the existing implementation.

This workflow avoids the failure mode of accidentally trying to own every field in the object, which is what happens if you just GET the object, make changes, and then apply. (Note that the server will notice you did this and reject your change!)

Reconstructive controllers

This kind of controller wasn't really possible prior to SSA. The idea here is to (whenever something changes etc) reconstruct from scratch the fields of the object as the controller wishes them to be, and then apply the change to the server, letting it figure out the result. I now recommend that new controllers start out this way–it's less fiddly to say what you want an object to look like than it is to say how you want it to change.

The client library supports this method of operation by default.

The only downside is that you may end up sending unneeded apply requests to the API server, even if actually the object already matches your controller’s desires. This doesn't matter if it happens once in a while, but for extremely high-throughput controllers, it might cause a performance problem for the cluster–specifically, the API server. No-op writes are not written to storage (etcd) or broadcast to any watchers, so it’s not really that big of a deal. If you’re worried about this anyway, today you could use the method explained in the previous section, or you could still do it this way for now, and wait for an additional client-side mechanism to suppress zero-change applies.

To get around this downside, why not GET the object and only send your apply if the object needs it? Surprisingly, it doesn't help much – a no-op apply is not very much more work for the API server than an extra GET; and an apply that changes things is cheaper than that same apply with a preceding GET. Worse, since it is a distributed system, something could change between your GET and apply, invalidating your computation. Instead, you can use this optimization on an object retrieved from a cache–then it legitimately will reduce load on the system (at the cost of a delay when a change is needed and the cache is a bit behind).

CI/CD systems

Continuous integration (CI) and/or continuous deployment (CD) systems are a special kind of controller which is doing something like reading manifests from source control (such as a Git repo) and automatically pushing them into the cluster. Perhaps the CI / CD process first generates manifests from a template, then runs some tests, and then deploys a change. Typically, users are the entities pushing changes into source control, although that’s not necessarily always the case.

Some systems like this continuously reconcile with the cluster, others may only operate when a change is pushed to the source control system. The following considerations are important for both, but more so for the continuously reconciling kind.

CI/CD systems are literally controllers, but for the purpose of apply, they are more like users, and unlike other controllers, they need to pay attention to conflicts. Reasoning:
- Abstractly, CI/CD systems can change anything, which means they could conflict with any controller out there. The recommendation that controllers force conflicts is assuming that controllers change a limited number of things and you can be reasonably sure that they won’t fight with other controllers about those things; that’s clearly not the case for CI/CD controllers.
- Concrete example: imagine the CI/CD system wants .spec.replicas for some Deployment to be 3, because that is the value that is checked into source code; however there is also a HorizontalPodAutoscaler (HPA) that targets the same deployment. The HPA computes a target scale and decides that there should be 10 replicas. Which should win? I just said that most controllers–including the HPA–should ignore conflicts. The HPA has no idea if it has been enabled incorrectly, and the HPA has no convenient way of informing users of errors.
- The other common cause of a CI/CD system getting a conflict is probably when it is trying to overwrite a hot-fix (hand-rolled patch) placed there by a system admin / SRE / dev-on-call. You almost certainly don’t want to override that automatically.
- Of course, sometimes SRE makes an accidental change, or a dev makes an unauthorized change – those you do want to notice and overwrite; however, the CI/CD system can’t tell the difference between these last two cases.
Hopefully this convinces you that CI/CD systems need error paths–a way to back-propagate these conflict errors to humans; in fact, they should have this already, certainly continuous integration systems need some way to report that tests are failing. But maybe I can also say something about how humans can deal with errors:
- Reject the hotfix: the (human) administrator of the CI/CD system observes the error, and manually force-applies the manifest in question. Then the CI/CD system will be able to apply the manifest successfully and become a co-owner.
  
  Optional: then the administrator applies a blank manifest (just the object type / namespace / name) to relinquish any fields they became a manager for. if this step is omitted, there's some chance the administrator will end up owning fields and causing an unwanted future conflict.
  
  Note: why an administrator? I'm assuming that developers which ordinarily push to the CI/CD system and / or its source control system may not have permissions to push directly to the cluster.
- Accept the hotfix: the author of the change in question sees the conflict, and edits their change to accept the value running in production.
- Accept then reject: as in the accept option, but after that manifest is applied, and the CI/CD queue owns everything again (so no conflicts), re-apply the original manifest.
- I can also imagine the CI/CD system permitting you to mark a manifest as “force conflicts” somehow– if there’s demand for this we could consider making a more standardized way to do this. A rigorous version of this which lets you declare exactly which conflicts you intend to force would require support from the API server; in lieu of that, you can make a second manifest with only that subset of fields.
- Future work: we could imagine an especially advanced CI/CD system that could parse metadata.managedFields data to see who or what they are conflicting with, over what fields, and decide whether or not to ignore the conflict. In fact, this information is also presented in any conflict errors, though perhaps not in an easily machine-parseable format. We (SIG API Machinery) mostly didn't expect that people would want to take this approach — so we would love to know if in fact people want/need the features implied by this approach, such as the ability, when applying to request to override certain conflicts but not others.
  
  If this sounds like an approach you'd want to take for your own controller, come talk to SIG API Machinery!
Happy applying!

https://kubernetes.io/blog/2022/10/20/advanced-server-side-apply/

Blog: Current State: 2019 Third Party Security Audit of Kubernetes
• sorangutan

1
Posts

18
Views

Authors (in alphabetical order): Cailyn Edwards (Shopify), Pushkar Joglekar (VMware), Rey Lejano (SUSE) and Rory McCune (DataDog)

We expect the brand new Third Party Security Audit of Kubernetes will be published later this month (Oct 2022).

In preparation for that, let's look at the state of findings that were made public as part of the last third party security audit of 2019 that was based on Kubernetes v1.13.4.

Motivation

Craig Ingram has graciously attempted over the years to keep track of the status of the findings reported in the last audit in this issue: kubernetes/kubernetes#81146. This blog post will attempt to dive deeper into this, address any gaps in tracking and become a point in time summary of the state of the findings reported from 2019.

This article should also help readers gain confidence through transparent communication, of work done by the community to address these findings and bubble up any findings that need help from community contributors.

Current State

The status of each issue / finding here is represented in a best effort manner. Authors do not claim to be 100% accurate on the status and welcome any corrections or feedback if the current state is not reflected accurately by commenting directly on the relevant issue.

#	Title	Issue	Status
1	hostPath PersistentVolumes enable PodSecurityPolicy bypass	#81110	closed, addressed by kubernetes/website#15756 and kubernetes/kubernetes#109798
2	Kubernetes does not facilitate certificate revocation	#81111	duplicate of #18982 and needs a KEP
3	HTTPS connections are not authenticated	#81112	Largely left as an end user exercise in setting up the right configuration
4	TOCTOU when moving PID to manager's cgroup via kubelet	#81113	Requires Node access for successful exploitation. Fix needed
5	Improperly patched directory traversal in `kubectl cp`	#76788	closed, assigned CVE-2019-11249, fixed in #80436
6	Bearer tokens are revealed in logs	#81114	closed, assigned CVE-2019-11250, fixed in #81330
7	Seccomp is disabled by default	#81115	closed, addressed by #101943
8	Pervasive world-accessible file permissions	#81116	#112384 ( in progress)
9	Environment variables expose sensitive data	#81117	closed, addressed by #84992 and #84677
10	Use of InsecureIgnoreHostKey in SSH connections	#81118	This feature was removed in v1.22: #102297
11	Use of InsecureSkipVerify and other TLS weaknesses	#81119	Needs a KEP
12	`kubeadm` performs potentially-dangerous reset operations	#81120	closed, fixed by #81495, #81494, and kubernetes/website#15881
13	Overflows when using strconv.Atoi and downcasting the result	#81121	closed, fixed by #89120
14	kubelet can cause an Out of Memory error with a malicious manifest	#81122	closed, fixed by #76518
15	`kubectl` can cause an Out Of Memory error with a malicious Pod specification	#81123	Fix needed
16	Improper fetching of PIDs allows incorrect cgroup movement	#81124	Fix needed
17	Directory traversal of host logs running kube-apiserver and kubelet	#81125	closed, fixed by #87273
18	Non-constant time password comparison	#81126	closed, fixed by #81152
19	Encryption recommendations not in accordance with best practices	#81127	Work in Progress
20	Adding credentials to containers by default is unsafe	#81128	Closed, fixed by #89193
21	kubelet liveness probes can be used to enumerate host network	#81129	Needs a KEP
22	iSCSI volume storage cleartext secrets in logs	#81130	closed, fixed by #81215
23	Hard coded credential paths	#81131	closed, awaiting more evidence
24	Log rotation is not atomic	#81132	Fix needed
25	Arbitrary file paths without bounding	#81133	Fix needed.
26	Unsafe JSON construction	#81134	Partially fixed
27	kubelet crash due to improperly handled errors	#81135	Closed. Fixed by #81135
28	Legacy tokens do not expire	#81136	closed, fixed as part of #70679
29	CoreDNS leaks internal cluster information across namespaces	#81137	Closed, resolved with CoreDNS v1.6.2. #81137 (comment)
30	Services use questionable default functions	#81138	Fix needed
31	Incorrect docker daemon process name in container manager	#81139	closed, fixed by #81083
32	Use standard formats everywhere	#81140	Needs a KEP
33	Superficial health check provides false sense of safety	#81141	closed, fixed by #81319
34	Hardcoded use of insecure gRPC transport	#81142	Needs a KEP
35	Incorrect handling of `Retry-After`	#81143	closed, fixed by #91048
36	Incorrect isKernelPid check	#81144	closed, fixed by #81086
37	Kubelet supports insecure TLS ciphersuites	#81145	closed but fix needed for #91444 (see this comment)

Inspired outcomes

Apart from fixes to the specific issues, the 2019 third party security audit also motivated security focussed enhancements in the next few releases of Kubernetes. One such example is Kubernetes Enhancement Proposal (KEP) 1933 Defend Against Logging Secrets via Static Analysis to prevent exposing secrets to logs with Patrick Rhomberg driving the implementation. As a result of this KEP, go-flow-levee, a taint propagation analysis tool configured to detect logging of secrets, is executed in a script as a Prow presubmit job. This KEP was introduced in v1.20.0 as an alpha feature, then graduated to beta in v1.21.0, and graduated to stable in v1.23.0. As stable, the analysis runs as a blocking presubmit test. This KEP also helped resolve the following issues from the 2019 third party security audit:

Remaining Work

Many of the 37 findings identified were fixed by work from our community members over the last 3 years. However, we still have some work left to do. Here's a breakdown of remaining work with rough estimates on time commitment, complexity and benefits to the ecosystem on fixing these pending issues.

Note: Anything requiring a KEP (Kubernetes Enhancement Proposal) is considered high time commitment and high complexity. Benefits to Ecosystem are roughly equivalent to risk of keeping the finding unfixed which is determined by Severity Level + Likelihood of a successful vulnerability exploit. These estimates and values in the table below are the authors' personal opinion. An individual or end users' threat model may rate the benefits to fix a particular issue higher or lower.

Title	Issue	Time Commitment	Complexity	Benefit to Ecosystem
Kubernetes does not facilitate certificate revocation	#81111	High	High	Medium
Use of InsecureSkipVerify and other TLS weaknesses	#81119	High	High	Medium
`kubectl` can cause a local Out Of Memory error with a malicious Pod specification	#81123	Medium	Medium	Medium
Improper fetching of PIDs allows incorrect cgroup movement	#81124	Medium	Medium	Medium
kubelet liveness probes can be used to enumerate host network	#81129	High	High	Medium
API Server supports insecure TLS ciphersuites	#81145	Medium	Medium	Low
TOCTOU when moving PID to manager's cgroup via kubelet	#81113	Medium	Medium	Low
Log rotation is not atomic	#81132	Medium	Medium	Low
Arbitrary file paths without bounding	#81133	Medium	Medium	Low
Services use questionable default functions	#81138	Medium	Medium	Low
Use standard formats everywhere	#81140	High	High	Very Low
Hardcoded use of insecure gRPC transport	#81142	High	High	Very Low

To get started on fixing any of these findings that need help, please consider getting involved in Kubernetes SIG Security by joining our bi-weekly meetings or hanging out with us on our Slack Channel.

https://kubernetes.io/blog/2022/10/05/current-state-2019-third-party-audit/

Blog: Introducing Kueue
• sorangutan

1

1
Posts

37
Views
Authors: Abdullah Gharaibeh (Google), Aldo Culquicondor (Google)

Whether on-premises or in the cloud, clusters face real constraints for resource usage, quota, and cost management reasons. Regardless of the autoscalling capabilities, clusters have finite capacity. As a result, users want an easy way to fairly and efficiently share resources.

In this article, we introduce Kueue, an open source job queueing controller designed to manage batch jobs as a single unit. Kueue leaves pod-level orchestration to existing stable components of Kubernetes. Kueue natively supports the Kubernetes Job API and offers hooks for integrating other custom-built APIs for batch jobs.

Why Kueue?

Job queueing is a key feature to run batch workloads at scale in both on-premises and cloud environments. The main goal of job queueing is to manage access to a limited pool of resources shared by multiple tenants. Job queueing decides which jobs should wait, which can start immediately, and what resources they can use.

Some of the most desired job queueing requirements include:
- Quota and budgeting to control who can use what and up to what limit. This is not only needed in clusters with static resources like on-premises, but it is also needed in cloud environments to control spend or usage of scarce resources.
- Fair sharing of resources between tenants. To maximize the usage of available resources, any unused quota assigned to inactive tenants should be allowed to be shared fairly between active tenants.
- Flexible placement of jobs across different resource types based on availability. This is important in cloud environments which have heterogeneous resources such as different architectures (GPU or CPU models) and different provisioning modes (spot vs on-demand).
- Support for autoscaled environments where resources can be provisioned on demand.
Plain Kubernetes doesn't address the above requirements. In normal circumstances, once a Job is created, the job-controller instantly creates the pods and kube-scheduler continuously attempts to assign the pods to nodes. At scale, this situation can work the control plane to death. There is also currently no good way to control at the job level which jobs should get which resources first, and no way to express order or fair sharing. The current ResourceQuota model is not a good fit for these needs because quotas are enforced on resource creation, and there is no queueing of requests. The intent of ResourceQuotas is to provide a builtin reliability mechanism with policies needed by admins to protect clusters from failing over.

In the Kubernetes ecosystem, there are several solutions for job scheduling. However, we found that these alternatives have one or more of the following problems:
- They replace existing stable components of Kubernetes, like kube-scheduler or the job-controller. This is problematic not only from an operational point of view, but also the duplication in the job APIs causes fragmentation of the ecosystem and reduces portability.
- They don't integrate with autoscaling, or
- They lack support for resource flexibility.
How Kueue works

With Kueue we decided to take a different approach to job queueing on Kubernetes that is anchored around the following aspects:
- Not duplicating existing functionalities already offered by established Kubernetes components for pod scheduling, autoscaling and job lifecycle management.
- Adding key features that are missing to existing components. For example, we invested in the Job API to cover more use cases like IndexedJob and fixed long standing issues related to pod tracking. While this path takes longer to land features, we believe it is the more sustainable long term solution.
- Ensuring compatibility with cloud environments where compute resources are elastic and heterogeneous.
For this approach to be feasible, Kueue needs knobs to influence the behavior of those established components so it can effectively manage when and where to start a job. We added those knobs to the Job API in the form of two features:
- Suspend field, which allows Kueue to signal to the job-controller when to start or stop a Job.
- Mutable scheduling directives, which allows Kueue to update a Job's .spec.template.spec.nodeSelector before starting the Job. This way, Kueue can control Pod placement while still delegating to kube-scheduler the actual pod-to-node scheduling.
Note that any custom job API can be managed by Kueue if that API offers the above two capabilities.

Resource model

Kueue defines new APIs to address the requirements mentioned at the beginning of this post. The three main APIs are:
- ResourceFlavor: a cluster-scoped API to define resource flavor available for consumption, like a GPU model. At its core, a ResourceFlavor is a set of labels that mirrors the labels on the nodes that offer those resources.
- ClusterQueue: a cluster-scoped API to define resource pools by setting quotas for one or more ResourceFlavor.
- LocalQueue: a namespaced API for grouping and managing single tenant jobs. In its simplest form, a LocalQueue is a pointer to the ClusterQueue that the tenant (modeled as a namespace) can use to start their jobs.
For more details, take a look at the API concepts documentation. While the three APIs may look overwhelming, most of Kueue’s operations are centered around ClusterQueue; the ResourceFlavor and LocalQueue APIs are mainly organizational wrappers.

Example use case

Imagine the following setup for running batch workloads on a Kubernetes cluster on the cloud:
- You have cluster-autoscaler installed in the cluster to automatically adjust the size of your cluster.
- There are two types of autoscaled node groups that differ on their provisioning policies: spot and on-demand. The nodes of each group are differentiated by the label instance-type=spot or instance-type=ondemand. Moreover, since not all Jobs can tolerate running on spot nodes, the nodes are tainted with spot=true:NoSchedule.
- To strike a balance between cost and resource availability, imagine you want Jobs to use up to 1000 cores of on-demand nodes, then use up to 2000 cores of spot nodes.
As an admin for the batch system, you define two ResourceFlavors that represent the two types of nodes:
```
---
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ResourceFlavor
metadata:
 name: ondemand
 labels:
 instance-type: ondemand 
---
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ResourceFlavor
metadata:
 name: spot
 labels:
 instance-type: spot
taints:
- effect: NoSchedule
 key: spot
 value: "true"
```
Then you define the quotas by creating a ClusterQueue as follows:
```
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ClusterQueue
metadata:
 name: research-pool
spec:
 namespaceSelector: {}
 resources:
 - name: "cpu"
 flavors:
 - name: ondemand
 quota:
 min: 1000
 - name: spot
 quota:
 min: 2000
```
Note that the order of flavors in the ClusterQueue resources matters: Kueue will attempt to fit jobs in the available quotas according to the order unless the job has an explicit affinity to specific flavors.

For each namespace, you define a LocalQueue that points to the ClusterQueue above:
```
apiVersion: kueue.x-k8s.io/v1alpha2
kind: LocalQueue
metadata:
 name: training
 namespace: team-ml
spec:
 clusterQueue: research-pool
```
Admins create the above setup once. Batch users are able to find the queues they are allowed to submit to by listing the LocalQueues in their namespace(s). The command is similar to the following: kubectl get -n my-namespace localqueues

To submit work, create a Job and set the kueue.x-k8s.io/queue-name annotation as follows:
```
apiVersion: batch/v1
kind: Job
metadata:
 generateName: sample-job-
 annotations:
 kueue.x-k8s.io/queue-name: training
spec:
 parallelism: 3
 completions: 3
 template:
 spec:
 tolerations:
 - key: spot
 operator: "Exists"
 effect: "NoSchedule"
 containers:
 - name: example-batch-workload
 image: registry.example/batch/calculate-pi:3.14
 args: ["30s"]
 resources:
 requests:
 cpu: 1
 restartPolicy: Never
```
Kueue intervenes to suspend the Job as soon as it is created. Once the Job is at the head of the ClusterQueue, Kueue evaluates if it can start by checking if the resources requested by the job fit the available quota.

In the above example, the Job tolerates spot resources. If there are previously admitted Jobs consuming all existing on-demand quota but not all of spot’s, Kueue admits the Job using the spot quota. Kueue does this by issuing a single update to the Job object that:
- Changes the .spec.suspend flag to false
- Adds the term instance-type: spot to the job's .spec.template.spec.nodeSelector so that when the pods are created by the job controller, those pods can only schedule onto spot nodes.
Finally, if there are available empty nodes with matching node selector terms, then kube-scheduler will directly schedule the pods. If not, then kube-scheduler will initially mark the pods as unschedulable, which will trigger the cluster-autoscaler to provision new nodes.

Future work and getting involved

The example above offers a glimpse of some of Kueue's features including support for quota, resource flexibility, and integration with cluster autoscaler. Kueue also supports fair-sharing, job priorities, and different queueing strategies. Take a look at the Kueue documentation to learn more about those features and how to use Kueue.

We have a number of features that we plan to add to Kueue, such as hierarchical quota, budgets, and support for dynamically sized jobs. In the more immediate future, we are focused on adding support for job preemption.

The latest Kueue release is available on Github; try it out if you run batch workloads on Kubernetes (requires v1.22 or newer). We are in the early stages of this project and we are seeking feedback of all levels, major or minor, so please don’t hesitate to reach out. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation. You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contributors who made this project possible!

https://kubernetes.io/blog/2022/10/04/introducing-kueue/
Blog: Kubernetes 1.25: alpha support for running Pods with user namespaces
• sorangutan

1

1
Posts

35
Views
Authors: Rodrigo Campos (Microsoft), Giuseppe Scrivano (Red Hat)

Kubernetes v1.25 introduces the support for user namespaces.

This is a major improvement for running secure workloads in Kubernetes. Each pod will have access only to a limited subset of the available UIDs and GIDs on the system, thus adding a new security layer to protect from other pods running on the same system.

How does it work?

A process running on Linux can use up to 4294967296 different UIDs and GIDs.

User namespaces is a Linux feature that allows mapping a set of users in the container to different users in the host, thus restricting what IDs a process can effectively use. Furthermore, the capabilities granted in a new user namespace do not apply in the host initial namespaces.

Why is it important?

There are mainly two reasons why user namespaces are important:
- improve security since they restrict the IDs a pod can use, so each pod can run in its own separate environment with unique IDs.
- enable running workloads as root in a safer manner.
In a user namespace we can map the root user inside the pod to a non-zero ID outside the container, containers believe in running as root while they are a regular unprivileged ID from the host point of view.

The process can keep capabilities that are usually restricted to privileged pods and do it in a safe way since the capabilities granted in a new user namespace do not apply in the host initial namespaces.

How do I enable user namespaces?

At the moment, user namespaces support is opt-in, so you must enable it for a pod setting hostUsers to false under the pod spec stanza:
```
apiVersion: v1
kind: Pod
spec:
hostUsers: false
containers:
- name: nginx
image: docker.io/nginx
```
The feature is behind a feature gate, so make sure to enable the UserNamespacesStatelessPodsSupport gate before you can use the new feature.

The runtime must also support user namespaces:
- containerd: support is planned for the 1.7 release. See containerd issue #7063 for more details.
- CRI-O: v1.25 has support for user namespaces.
Support for this in cri-dockerd is not planned yet.

How do I get involved?

You can reach SIG Node by several means:
- Slack: #sig-node
- Mailing list
- Open Community Issues/PRs
You can also contact us directly:
- GitHub / Slack: @rata @giuseppe
https://kubernetes.io/blog/2022/10/03/userns-alpha/

Blog: Enforce CRD Immutability with CEL Transition Rules
• sorangutan

1
Posts

48
Views

Author: Alexander Zielenski (Google)

Immutable fields can be found in a few places in the built-in Kubernetes types. For example, you can't change the .metadata.name of an object. Specific objects have fields where changes to existing objects are constrained; for example, the .spec.selector of a Deployment.

Aside from simple immutability, there are other common design patterns such as lists which are append-only, or a map with mutable values and immutable keys.

Until recently the best way to restrict field mutability for CustomResourceDefinitions has been to create a validating admission webhook: this means a lot of complexity for the common case of making a field immutable.

Beta since Kubernetes 1.25, CEL Validation Rules allow CRD authors to express validation constraints on their fields using a rich expression language, CEL. This article explores how you can use validation rules to implement a few common immutability patterns directly in the manifest for a CRD.

Basics of validation rules

The new support for CEL validation rules in Kubernetes allows CRD authors to add complicated admission logic for their resources without writing any code!

For example, A CEL rule to constrain a field maximumSize to be greater than a minimumSize for a CRD might look like the following:

rule: |
  self.maximumSize > self.minimumSize
message: 'Maximum size must be greater than minimum size.'

The rule field contains an expression written in CEL. self is a special keyword in CEL which refers to the object whose type contains the rule.

The message field is an error message which will be sent to Kubernetes clients whenever this particular rule is not satisfied.

For more details about the capabilities and limitations of Validation Rules using CEL, please refer to validation rules. The CEL specification is also a good reference for information specifically related to the language.

Immutability patterns with CEL validation rules

This section implements several common use cases for immutability in Kubernetes CustomResourceDefinitions, using validation rules expressed as kubebuilder marker comments. Resultant OpenAPI generated by the kubebuilder marker comments will also be included so that if you are writing your CRD manifests by hand you can still follow along.

Project setup

To use CEL rules with kubebuilder comments, you first need to set up a Golang project structure with the CRD defined in Go.

You may skip this step if you are not using kubebuilder or are only interested in the resultant OpenAPI extensions.

Begin with a folder structure of a Go module set up like the following. If you have your own project already set up feel free to adapt this tutorial to your liking:

graph LR . --> generate.go . --> pkg --> apis --> stable.example.com --> v1 v1 --> doc.go v1 --> types.go . --> tools.go

This is the typical folder structure used by Kubernetes projects for defining new API resources.

doc.go contains package-level metadata such as the group and the version:

// +groupName=stable.example.com
// +versionName=v1
package v1

types.go contains all type definitions in stable.example.com/v1

package v1

import (
 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// An empty CRD as an example of defining a type using controller tools
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
type TestCRD struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 Spec TestCRDSpec `json:"spec,omitempty"`
 Status TestCRDStatus `json:"status,omitempty"`
}

type TestCRDStatus struct {}
type TestCRDSpec struct {
 // You will fill this in as you go along
}

tools.go contains a dependency on controller-gen which will be used to generate the CRD definition:

//go:build tools

package celimmutabilitytutorial

// Force direct dependency on code-generator so that it may be executed with go run
import (
 _ "sigs.k8s.io/controller-tools/cmd/controller-gen"
)

Finally, generate.gocontains a go:generate directive to make use of controller-gen. controller-gen parses our types.go and creates generates CRD yaml files into a crd folder:

package celimmutabilitytutorial

//go:generate go run sigs.k8s.io/controller-tools/cmd/controller-gen crd paths=./pkg/apis/... output:dir=./crds

You may now want to add dependencies for our definitions and test the code generation:

cd cel-immutability-tutorial
go mod init <your-org>/<your-module-name>
go mod tidy
go generate ./...

After running these commands you now have completed the basic project structure. Your folder tree should look like the following:

graph LR . --> crds --> stable.example.com_testcrds.yaml . --> generate.go . --> go.mod . --> go.sum . --> pkg --> apis --> stable.example.com --> v1 v1 --> doc.go v1 --> types.go . --> tools.go

The manifest for the example CRD is now available in crds/stable.example.com_testcrds.yaml.

Immutablility after first modification

A common immutability design pattern is to make the field immutable once it has been first set. This example will throw a validation error if the field after changes after being first initialized.

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type ImmutableSinceFirstWrite struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Optional
 // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
 // +kubebuilder:validation:MaxLength=512
 Value string `json:"value"`
}

The +kubebuilder directives in the comments inform controller-gen how to annotate the generated OpenAPI. The XValidation rule causes the rule to appear among the x-kubernetes-validations OpenAPI extension. Kubernetes then respects the OpenAPI spec to enforce our constraints.

To enforce a field's immutability after its first write, you need to apply the following constraints:

Field must be allowed to be initially unset +kubebuilder:validation:Optional
Once set, field must not be allowed to be removed: !has(oldSelf.value) | has(self.value) (type-scoped rule)
Once set, field must not be allowed to change value self == oldSelf (field-scoped rule)

Also note the additional directive +kubebuilder:validation:MaxLength. CEL requires that all strings have attached max length so that it may estimate the computation cost of the rule. Rules that are too expensive will be rejected. For more information on CEL cost budgeting, check out the other tutorial.

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincefirstwrites.yaml

customresourcedefinition.apiextensions.k8s.io/immutablesincefirstwrites.stable.example.com created

Creating initial empty object with no value is permitted since value is optional:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
EOF

immutablesincefirstwrite.stable.example.com/test1 created

The initial modification of value succeeds:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
value: Hello, world!
EOF

immutablesincefirstwrite.stable.example.com/test1 configured

An attempt to change value is blocked by the field-level validation rule. Note the error message shown to the user comes from the validation rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
value: Hello, new world!
EOF

The ImmutableSinceFirstWrite "test1" is invalid: value: Invalid value: "string": Value is immutable

An attempt to remove the value field altogether is blocked by the other validation rule on the type. The error message also comes from the rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
EOF

The ImmutableSinceFirstWrite "test1" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

Note that in the generated schema there are two separate rule locations. One is directly attached to the property immutable_since_first_write. The other rule is associated with the crd type itself.

openAPIV3Schema:
 properties:
 value:
 maxLength: 512
 type: string
 x-kubernetes-validations:
 - message: Value is immutable
 rule: self == oldSelf
 type: object
 x-kubernetes-validations:
 - message: Value is required once set
 rule: '!has(oldSelf.value) || has(self.value)'

Immutability upon object creation

A field which is immutable upon creation time is implemented similarly to the earlier example. The difference is that that field is marked required, and the type-scoped rule is no longer necessary.

type ImmutableSinceCreation struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Required
 // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
 // +kubebuilder:validation:MaxLength=512
 Value string `json:"value"`
}

This field will be required when the object is created, and after that point will not be allowed to be modified. Our CEL Validation Rule self == oldSelf

Usage example

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincecreations.yaml

customresourcedefinition.apiextensions.k8s.io/immutablesincecreations.stable.example.com created

Applying an object without the required field should fail:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
EOF

The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Now that the field has been added, the operation is permitted:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
value: Hello, world!
EOF

immutablesincecreation.stable.example.com/test1 created

If you attempt to change the value, the operation is blocked due to the validation rules in the CRD. Note that the error message is as it was defined in the validation rule.

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
value: Hello, new world!
EOF

The ImmutableSinceCreation "test1" is invalid: value: Invalid value: "string": Value is immutable

Also if you attempted to remove value altogether after adding it, you will see an error as expected:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
EOF

The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Generated schema

openAPIV3Schema:
 properties:
 value:
 maxLength: 512
 type: string
 x-kubernetes-validations:
 - message: Value is immutable
 rule: self == oldSelf
 required:
 - value
 type: object

Append-only list of containers

In the case of ephemeral containers on Pods, Kubernetes enforces that the elements in the list are immutable, and can’t be removed. The following example shows how you could use CEL to achieve the same behavior.

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type AppendOnlyList struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Optional
 // +kubebuilder:validation:MaxItems=100
 // +kubebuilder:validation:XValidation:rule="oldSelf.all(x, x in self)",message="Values may only be added"
 Values []v1.EphemeralContainer `json:"value"`
}

Once set, field must not be deleted: !has(oldSelf.value) || has(self.value) (type-scoped)
Once a value is added it is not removed: oldSelf.all(x, x in self) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional

Note that for cost-budgeting purposes, MaxItems is also required to be specified.

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_appendonlylists.yaml

customresourcedefinition.apiextensions.k8s.io/appendonlylists.stable.example.com created

Creating an inital list with one element inside should succeed without problem:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
value:
 - name: container1
 image: nginx/nginx
EOF

appendonlylist.stable.example.com/testlist created

Adding an element to the list should also proceed without issue as expected:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
value:
 - name: container1
 image: nginx/nginx
 - name: container2
 image: mongodb/mongodb
EOF

appendonlylist.stable.example.com/testlist configured

But if you now attempt to remove an element, the error from the validation rule is triggered:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
value:
 - name: container1
 image: nginx/nginx
EOF

The AppendOnlyList "testlist" is invalid: value: Invalid value: "array": Values may only be added

Additionally, to attempt to remove the field once it has been set is also disallowed by the type-scoped validation rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
EOF

The AppendOnlyList "testlist" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

openAPIV3Schema:
 properties:
 value:
 items: ...
 maxItems: 100
 type: array
 x-kubernetes-validations:
 - message: Values may only be added
 rule: oldSelf.all(x, x in self)
 type: object
 x-kubernetes-validations:
 - message: Value is required once set
 rule: '!has(oldSelf.value) || has(self.value)'

Map with append-only keys, immutable values

// A map which does not allow keys to be removed or their values changed once set. New keys may be added, however.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.values) || has(self.values)", message="Value is required once set"
type MapAppendOnlyKeys struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Optional
 // +kubebuilder:validation:MaxProperties=10
 // +kubebuilder:validation:XValidation:rule="oldSelf.all(key, key in self && self[key] == oldSelf[key])",message="Keys may not be removed and their values must stay the same"
 Values map[string]string `json:"values,omitempty"`
}

Once set, field must not be deleted: !has(oldSelf.values) || has(self.values) (type-scoped)
Once a key is added it is not removed nor is its value modified: oldSelf.all(key, key in self && self[key] == oldSelf[key]) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_mapappendonlykeys.yaml

customresourcedefinition.apiextensions.k8s.io/mapappendonlykeys.stable.example.com created

Creating an initial object with one key within values should be permitted:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
values:
 key1: value1
EOF

mapappendonlykeys.stable.example.com/testmap created

Adding new keys to the map should also be permitted:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
values:
 key1: value1
 key2: value2
EOF

mapappendonlykeys.stable.example.com/testmap configured

But if a key is removed, the error messagr from the validation rule should be returned:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
values:
 key1: value1
EOF

The MapAppendOnlyKeys "testmap" is invalid: values: Invalid value: "object": Keys may not be removed and their values must stay the same

If the entire field is removed, the other validation rule is triggered and the operation is prevented. Note that the error message for the validation rule is shown to the user.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
EOF

The MapAppendOnlyKeys "testmap" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

openAPIV3Schema:
 description: A map which does not allow keys to be removed or their values
 changed once set. New keys may be added, however.
 properties:
 values:
 additionalProperties:
 type: string
 maxProperties: 10
 type: object
 x-kubernetes-validations:
 - message: Keys may not be removed and their values must stay the same
 rule: oldSelf.all(key, key in self && self[key] == oldSelf[key])
 type: object
 x-kubernetes-validations:
 - message: Value is required once set
 rule: '!has(oldSelf.values) || has(self.values)'

Going further

The above examples showed how CEL rules can be added to kubebuilder types. The same rules can be added directly to OpenAPI if writing a manifest for a CRD by hand.

For native types, the same behavior can be achieved using kube-openapi’s marker +validations.

Usage of CEL within Kubernetes Validation Rules is so much more powerful than what has been shown in this article. For more information please check out validation rules in the Kubernetes documentation and CRD Validation Rules Beta blog post.

https://kubernetes.io/blog/2022/09/29/enforce-immutability-using-cel/

Blog: Kubernetes 1.25: Kubernetes In-Tree to CSI Volume Migration Status Update
• sorangutan

1
Posts

28
Views

Author: Jiawei Wang (Google)

The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure has already been beta since v1.17. CSI migration was introduced as alpha in Kubernetes v1.14. Since then, SIG Storage and other Kubernetes special interest groups are working to ensure feature stability and compatibility in preparation for CSI Migration feature to go GA.

SIG Storage is excited to announce that the core CSI Migration feature is generally available in Kubernetes v1.25 release!

SIG Storage wrote a blog post in v1.23 for CSI Migration status update which discussed the CSI migration status for each storage driver. It has been a while and this article is intended to give a latest status update on each storage driver for their CSI Migration status in Kubernetes v1.25.

Quick recap: What is CSI Migration, and why migrate?

The Container Storage Interface (CSI) was designed to help Kubernetes replace its existing, in-tree storage driver mechanisms - especially vendor specific plugins. Kubernetes support for the Container Storage Interface has been generally available since Kubernetes v1.13. Support for using CSI drivers was introduced to make it easier to add and maintain new integrations between Kubernetes and storage backend technologies. Using CSI drivers allows for better maintainability (driver authors can define their own release cycle and support lifecycle) and reduce the opportunity for vulnerabilities (with less in-tree code, the risks of a mistake are reduced, and cluster operators can select only the storage drivers that their cluster requires).

As more CSI Drivers were created and became production ready, SIG Storage wanted all Kubernetes users to benefit from the CSI model. However, we could not break API compatibility with the existing storage API types due to k8s architecture conventions. The solution we came up with was CSI migration: a feature that translates in-tree APIs to equivalent CSI APIs and delegates operations to a replacement CSI driver.

The CSI migration effort enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver from the storage backend. If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. Existing StorageClass, PersistentVolume and PersistentVolumeClaim objects should continue to work. When a Kubernetes cluster administrator updates a cluster to enable CSI migration, existing workloads that utilize PVCs which are backed by in-tree storage plugins will continue to function as they always have. However, behind the scenes, Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers.

For example, suppose you are a kubernetes.io/gce-pd user; after CSI migration, you can still use kubernetes.io/gce-pd to provision new volumes, mount existing GCE-PD volumes or delete existing volumes. All existing APIs and Interface will still function correctly. However, the underlying function calls are all going through the GCE PD CSI driver instead of the in-tree Kubernetes function.

This enables a smooth transition for end users. Additionally as storage plugin developers, we can reduce the burden of maintaining the in-tree storage plugins and eventually remove them from the core Kubernetes binary.

What is the timeline / status?

The current and targeted releases for each individual driver is shown in the table below:

Driver	Alpha	Beta (in-tree deprecated)	Beta (on-by-default)	GA	Target "in-tree plugin" removal
AWS EBS	1.14	1.17	1.23	1.25	1.27 (Target)
Azure Disk	1.15	1.19	1.23	1.24	1.26 (Target)
Azure File	1.15	1.21	1.24	1.26 (Target)	1.28 (Target)
Ceph FS	1.26 (Target)
Ceph RBD	1.23	1.26 (Target)	1.27 (Target)	1.28 (Target)	1.30 (Target)
GCE PD	1.14	1.17	1.23	1.25	1.27 (Target)
OpenStack Cinder	1.14	1.18	1.21	1.24	1.26 (Target)
Portworx	1.23	1.25	1.26 (Target)	1.27 (Target)	1.29 (Target)
vSphere	1.18	1.19	1.25	1.26 (Target)	1.28 (Target)

The following storage drivers will not have CSI migration support. The scaleio, flocker, quobyte and storageos drivers were removed; the others are deprecated and will be removed from core Kubernetes in the coming releases.

Driver	Deprecated	Code Removal
Flocker	1.22	1.25
GlusterFS	1.25	1.26 (Target)
Quobyte	1.22	1.25
ScaleIO	1.16	1.22
StorageOS	1.22	1.25

What does it mean for the core CSI Migration feature to go GA?

Core CSI Migration goes to GA means that the general framework, core library and API for CSI migration is stable for Kubernetes v1.25 and will be part of future Kubernetes releases as well.

If you are a Kubernetes distribution maintainer, this means if you disabled CSIMigration feature gate previously, you are no longer allowed to do so because the feature gate has been locked.
If you are a Kubernetes storage driver developer, this means you can expect no backwards incompatibility changes in the CSI migration library.
If you are a Kubernetes maintainer, expect nothing changes from your day to day development flows.
If you are a Kubernetes user, expect nothing to change from your day-to-day usage flows. If you encounter any storage related issues, contact the people who operate your cluster (if that's you, contact the provider of your Kubernetes distribution, or get help from the community).

What does it mean for the storage driver CSI migration to go GA?

Storage Driver CSI Migration goes to GA means that the specific storage driver supports CSI Migration. Expect feature parity between the in-tree plugin with the CSI driver.

If you are a Kubernetes distribution maintainer, make sure you install the corresponding CSI driver on the distribution. And make sure you are not disabling the specific CSIMigration{provider} flag, as they are locked.
If you are a Kubernetes storage driver maintainer, make sure the CSI driver can ensure feature parity if it supports CSI migration.
If you are a Kubernetes maintainer/developer, expect nothing to change from your day-to-day development flows.
If you are a Kubernetes user, the CSI Migration feature should be completely transparent to you, the only requirement is to install the corresponding CSI driver.

What's next?

We are expecting cloud provider in-tree storage plugins code removal to start to happen as part of the v1.26 and v1.27 releases of Kubernetes. More and more drivers that support CSI migration will go GA in the upcoming releases.

How do I get involved?

The Kubernetes Slack channel #csi-migration along with any of the standard SIG Storage communication channels are great ways to reach out to the SIG Storage and migration working group teams.

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. We offer a huge thank you to the contributors who stepped up these last quarters to help move the project forward:

Xing Yang (xing-yang)
Hemant Kumar (gnufied)

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CSI migration feature:

Andy Zhang (andyzhangz)
Divyen Patel (divyenpatel)
Deep Debroy (ddebroy)
Humble Devassy Chirammal (humblec)
Ismail Alidzhikov (ialidzhikov)
Jordan Liggitt (liggitt)
Matthew Cary (mattcary)
Matthew Wong (wongma7)
Neha Arora (nearora-msft)
Oksana Naumov (trierra)
Saad Ali (saad-ali)
Michelle Au (msau42)

Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

https://kubernetes.io/blog/2022/09/26/storage-in-tree-to-csi-migration-status-update-1.25/

Blog: Kubernetes 1.25: CustomResourceDefinition Validation Rules Graduate to Beta
• sorangutan

1
Posts

32
Views

Authors: Joe Betz (Google), Cici Huang (Google), Kermit Alexander (Google)

In Kubernetes 1.25, Validation rules for CustomResourceDefinitions (CRDs) have graduated to Beta!

Validation rules make it possible to declare how custom resources are validated using the Common Expression Language (CEL). For example:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
 ...
 openAPIV3Schema:
 type: object
 properties:
 spec:
 type: object
 x-kubernetes-validations:
 - rule: "self.minReplicas <= self.replicas && self.replicas <= self.maxReplicas"
 message: "replicas should be in the range minReplicas..maxReplicas."
 properties:
 replicas:
 type: integer
 ...

Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:

Validation Rule	Purpose
`self.minReplicas <= self.replicas`	Validate an integer field is less than or equal to another integer field
`'Available' in self.stateCounts`	Validate an entry with the 'Available' key exists in a map
`self.set1.all(e, !(e in self.set2))`	Validate that the elements of two sets are disjoint
`self == oldSelf`	Validate that a required field is immutable once it is set
`self.created + self.ttl < self.expired`	Validate that 'expired' date is after a 'create' date plus a 'ttl' duration

Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.

Why CEL?

CEL was chosen as the language for validation rules for a couple reasons:

CEL expressions can easily be inlined into CRD schemas. They are sufficiently expressive to replace the vast majority of CRD validation checks currently implemented in admission webhooks. This results in CRDs that are self-contained and are easier to understand.
CEL expressions are compiled and type checked against a CRD's schema "ahead-of-time" (when CRDs are created and updated) allowing them to be evaluated efficiently and safely "runtime" (when custom resources are validated). Even regex string literals in CEL are validated and pre-compiled when CRDs are created or updated.

Why not use validation webhooks?

Benefits of using validation rules when compared with validation webhooks:

CRD authors benefit from a simpler workflow since validation rules eliminate the need to develop and maintain a webhook.
Cluster administrators benefit by no longer having to install, upgrade and operate webhooks for the purposes of CRD validation.
Cluster operability improves because CRD validation no longer requires a remote call to a webhook endpoint, eliminating a potential point of failure in the request-serving-path of the Kubernetes API server. This allows clusters to retain high availability while scaling to larger amounts of installed CRD extensions, since expected control plane availability would otherwise decrease with each additional webhook installed.

Getting started with validation rules

Writing validation rules in OpenAPIv3 schemas

You can define validation rules for any level of a CRD's OpenAPIv3 schema. Validation rules are automatically scoped to their location in the schema where they are declared.

Good practices for CRD validation rules:

Scope validation rules as close as possible to the fields(s) they validate.
Use multiple rules when validating independent constraints.
Do not use validation rules for validations already
Use OpenAPIv3 value validations (maxLength, maxItems, maxProperties, required, enum, minimum, maximum, ..) and string formats where available.
Use x-kubernetes-int-or-string, x-kubernetes-embedded-type and x-kubernetes-list-type=(set|map) were appropriate.

Examples of good practice:

Validation	Best Practice	Example(s)
Validate an integer is between 0 and 100.	Use OpenAPIv3 value validations.	type: integer minimum: 0 maximum: 100
Constraint the max size limits on maps (objects with additionalProperties), arrays and string.	Use OpenAPIv3 value validations. Recommended for all maps, arrays and strings. This best practice is essential for rule cost estimation (explained below).	type: maxItems: 100
Require a date-time be more recent than a particular timestamp.	Use OpenAPIv3 string formats to declare that the field is a date-time. Use validation rules to compare it to a particular timestamp.	type: string format: date-time x-kubernetes-validations: - rule: "self >= timestamp('2000-01-01T00:00:00.000Z')"
Require two sets to be disjoint.	Use x-kubernetes-list-type to validate that the arrays are sets. Use validation rules to validate the sets are disjoint.	type: object properties: set1: type: array x-kubernetes-list-type: set set2: ... x-kubernetes-validations: - rule: "!self.set1.all(e, !(e in self.set2))"

CRD transition rules

Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.

Transition rule examples:

Transition Rule	Purpose
`self == oldSelf`	For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.
(on parent of field) `has(self.field) == has(oldSelf.field)` on field: `self == oldSelf`	Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).
`self.all(x, x in oldSelf)`	Only allow adding items to a field that represents a set (prevent removals).
`self >= oldSelf`	Validate that a number is monotonically increasing.

Using the Functions Libraries

Validation rules have access to a couple different function libraries:

CEL standard functions, defined in the list of standard definitions
CEL standard macros
CEL extended string function library
Kubernetes CEL extension library which includes supplemental functions for lists, regex, and URLs.

Examples of function libraries in use:

Validation Rule	Purpose
`!(self.getDayOfWeek() in [0, 6])`	Validate that a date is not a Sunday or Saturday.
`isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']`	Validate that a URL has an allowed hostname.
`self.map(x, x.weight).sum() == 1`	Validate that the weights of a list of objects sum to 1.
`int(self.find('^[0-9]*')) < 100`	Validate that a string starts with a number less than 100.
`self.isSorted()`	Validates that a list is sorted.

Resource use and limits

To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.

Estimated cost limit

CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n^2) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.

Good practice:

Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.

Runtime cost limits for CRD validation rules

In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.

With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.

CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.

Future work

We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in an upcoming Kubernetes release!

There is a growing community of Kubernetes contributors thinking about how to make it possible to write extensible admission controllers using CEL as a substitute for admission webhooks for policy enforcement use cases. Anyone interested should reach out to us on the usual SIG API Machinery channels or via slack at #sig-api-machinery-cel-dev.

Acknowledgements

Special thanks to Cici Huang, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to Validation Rules!

https://kubernetes.io/blog/2022/09/23/crd-validation-rules-beta/

Blog: Kubernetes 1.25: Use Secrets for Node-Driven Expansion of CSI Volumes
• sorangutan

1

1
Posts

27
Views
Author: Humble Chirammal (Red Hat), Louis Koo (deeproute.ai)

Kubernetes v1.25, released earlier this month, introduced a new feature that lets your cluster expand storage volumes, even when access to those volumes requires a secret (for example: a credential for accessing a SAN fabric) to perform node expand operation. This new behavior is in alpha and you must enable a feature gate (CSINodeExpandSecret) to make use of it. You must also be using CSI storage; this change isn't relevant to storage drivers that are built in to Kubernetes.

To turn on this new, alpha feature, you enable the CSINodeExpandSecret feature gate for the kube-apiserver and kubelet, which turns on a mechanism to send secretRef configuration as part of NodeExpansion by the CSI drivers thus make use of the same to perform node side expansion operation with the underlying storage system.

What is this all about?

Before Kubernetes v1.24, you were able to define a cluster-level StorageClass that made use of StorageClass Secrets, but you didn't have any mechanism to specify the credentials that would be used for operations that take place when the storage was mounted onto a node and when the volume has to be expanded at node side.

The Kubernetes CSI already implemented a similar mechanism specific kinds of volume resizes; namely, resizes of PersistentVolumes where the resizes take place independently from any node referred as Controller Expansion. In that case, you associate a PersistentVolume with a Secret that contains credentials for volume resize actions, so that controller expansion can take place. CSI also supports a nodeExpandVolume operation which CSI drivers can make use independent of Controller Expansion or along with Controller Expansion on which, where the resize is driven from a node in your cluster where the volume is attached. Please read Kubernetes 1.24: Volume Expansion Now A Stable Feature
- At times, the CSI driver needs to check the actual size of the backend block storage (or image) before proceeding with a node-level filesystem expand operation. This avoids false positive returns from the backend storage cluster during filesystem expands.
- When a PersistentVolume represents encrypted block storage (for example using LUKS) you need to provide a passphrase in order to expand the device, and also to make it possible to grow the filesystem on that device.
- For various validations at time of node expansion, the CSI driver has to be connected to the backend storage cluster. If the nodeExpandVolume request includes a secretRef then the CSI driver can make use of the same and connect to the storage cluster to perform the cluster operations.
How does it work?

To enable this functionality from this version of Kubernetes, SIG Storage have introduced a new feature gate called CSINodeExpandSecret. Once the feature gate is enabled in the cluster, NodeExpandVolume requests can include a secretRef field. The NodeExpandVolume request is part of CSI; for example, in a request which has been sent from the Kubernetes control plane to the CSI driver.

As a cluster operator, you admin can specify these secrets as an opaque parameter in a StorageClass, the same way that you can already specify other CSI secret data. The StorageClass needs to have some CSI-specific parameters set. Here's an example of those parameters:
```
csi.storage.k8s.io/node-expand-secret-name: test-secret
csi.storage.k8s.io/node-expand-secret-namespace: default
```
If feature gates are enabled and storage class carries the above secret configuration, the CSI provisioner receives the credentials from the Secret as part of the NodeExpansion request.

CSI volumes that require secrets for online expansion will have NodeExpandSecretRef field set. If not set, the NodeExpandVolume CSI RPC call will be made without a secret.

Trying it out
1. Enable the CSINodeExpandSecret feature gate (please refer to Feature Gates).
2. Create a Secret, and then a StorageClass that uses that Secret.
Here's an example manifest for a Secret that holds credentials:
```
apiVersion: v1
kind: Secret
metadata:
 name: test-secret
 namespace: default
data:
stringData:
 username: admin
 password: t0p-Secret
```
Here's an example manifest for a StorageClass that refers to those credentials:
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
 name: csi-blockstorage-sc
parameters:
 csi.storage.k8s.io/node-expand-secret-name: test-secret  # the name of the Secret
 csi.storage.k8s.io/node-expand-secret-namespace: default  # the namespace that the Secret is in
provisioner: blockstorage.cloudprovider.example
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
```
Example output

If the PersistentVolumeClaim (PVC) was created successfully, you can see that configuration within the spec.csi field of the PersistentVolume (look for spec.csi.nodeExpandSecretRef). Check that it worked by running kubectl get persistentvolume <pv_name> -o yaml. You should see something like.
```
apiVersion: v1
kind: PersistentVolume
metadata:
 annotations:
 pv.kubernetes.io/provisioned-by: blockstorage.cloudprovider.example
 creationTimestamp: "2022-08-26T15:14:07Z"
 finalizers:
 - kubernetes.io/pv-protection
 name: pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0
 resourceVersion: "420263"
 uid: 6fa824d7-8a06-4e0c-b722-d3f897dcbd65
spec:
 accessModes:
 - ReadWriteOnce
 capacity:
 storage: 6Gi
 claimRef:
 apiVersion: v1
 kind: PersistentVolumeClaim
 name: csi-pvc
 namespace: default
 resourceVersion: "419862"
 uid: 95eb531a-d675-49f6-940b-9bc3fde83eb0
 csi:
 driver: blockstorage.cloudprovider.example
 nodeExpandSecretRef:
 name: test-secret
 namespace: default
 volumeAttributes:
 storage.kubernetes.io/csiProvisionerIdentity: 1648042783218-8081-blockstorage.cloudprovider.example
 volumeHandle: e21c7809-aabb-11ec-917a-2e2e254eb4cf
 nodeAffinity:
 required:
 nodeSelectorTerms:
 - matchExpressions:
 - key: topology.hostpath.csi/node
 operator: In
 values:
 - racknode01
 persistentVolumeReclaimPolicy: Delete
 storageClassName: csi-blockstorage-sc
 volumeMode: Filesystem
status:
 phase: Bound
```
If you then trigger online storage expansion, the kubelet passes the appropriate credentials to the CSI driver, by loading that Secret and passing the data to the storage driver.

Here's an example debug log:
```
I0330 03:29:51.966241 1 server.go:101] GRPC call: /csi.v1.Node/NodeExpandVolume
I0330 03:29:51.966261 1 server.go:105] GRPC request: {"capacity_range":{"required_bytes":7516192768},"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/blockstorage.cloudprovider.example/f7c62e6e08ce21e9b2a95c841df315ed4c25a15e91d8fcaf20e1c2305e5300ab/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_id":"e21c7809-aabb-11ec-917a-2e2e254eb4cf","volume_path":"/var/lib/kubelet/pods/bcb1b2c4-5793-425c-acf1-47163a81b4d7/volumes/kubernetes.io~csi/pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0/mount"}
I0330 03:29:51.966360 1 nodeserver.go:459] req:volume_id:"e21c7809-aabb-11ec-917a-2e2e254eb4cf" volume_path:"/var/lib/kubelet/pods/bcb1b2c4-5793-425c-acf1-47163a81b4d7/volumes/kubernetes.io~csi/pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0/mount" capacity_range:<required_bytes:7516192768 > staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/blockstorage.cloudprovider.example/f7c62e6e08ce21e9b2a95c841df315ed4c25a15e91d8fcaf20e1c2305e5300ab/globalmount" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_MULTI_WRITER > > secrets:<key:"XXXXXX" value:"XXXXX" > secrets:<key:"XXXXX" value:"XXXXXX" >
```
The future

As this feature is still in alpha, Kubernetes Storage SIG expect to update or get feedback from CSI driver authors with more tests and implementation. The community plans to eventually promote the feature to Beta in upcoming releases.

Get involved or learn more?

The enhancement proposal includes lots of detail about the history and technical implementation of this feature.

To learn more about StorageClass based dynamic provisioning in Kubernetes, please refer to Storage Classes and Persistent Volumes.

Please get involved by joining the Kubernetes Storage SIG (Special Interest Group) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

https://kubernetes.io/blog/2022/09/21/kubernetes-1-25-use-secrets-while-expanding-csi-volumes-on-node-alpha/
Blog: Kubernetes 1.25: Local Storage Capacity Isolation Reaches GA
• sorangutan

1

1
Posts

23
Views
Author: Jing Xu (Google)

Local ephemeral storage capacity isolation was introduced as a alpha feature in Kubernetes 1.7 and it went beta in 1.9. With Kubernetes 1.25 we are excited to announce general availability(GA) of this feature.

Pods use ephemeral local storage for scratch space, caching, and logs. The lifetime of local ephemeral storage does not extend beyond the life of the individual pod. It is exposed to pods using the container’s writable layer, logs directory, and EmptyDir volumes. Before this feature was introduced, there were issues related to the lack of local storage accounting and isolation, such as Pods not knowing how much local storage is available and being unable to request guaranteed local storage. Local storage is a best-effort resource and pods can be evicted due to other pods filling the local storage.

The local storage capacity isolation feature allows users to manage local ephemeral storage in the same way as managing CPU and memory. It provides support for capacity isolation of shared storage between pods, such that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of shared storage exceeds that limit. It also allows setting ephemeral storage requests for resource reservation. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.

How to use local storage capacity isolation

A typical configuration for local ephemeral storage is to place all different kinds of ephemeral local data (emptyDir volumes, writeable layers, container images, logs) into one filesystem. Typically, both /var/lib/kubelet and /var/log are on the system's root filesystem. If users configure the local storage in different ways, kubelet might not be able to correctly measure disk usage and use this feature.

Setting requests and limits for local ephemeral storage

You can specify ephemeral-storage for managing local ephemeral storage. Each container of a Pod can specify either or both of the following:
- spec.containers[].resources.limits.ephemeral-storage
- spec.containers[].resources.requests.ephemeral-storage
In the following example, the Pod has two containers. The first container has a request of 8GiB of local ephemeral storage and a limit of 12GiB. The second container requests 2GiB of local storage, but no limit setting. Therefore, the Pod requests a total of 10GiB (8GiB+2GiB) of local ephemeral storage and enforces a limit of 12GiB of local ephemeral storage. It also sets emptyDir sizeLimit to 5GiB. With this setting in pod spec, it will affect how the scheduler makes a decision on scheduling pods and also how kubelet evict pods.
```
apiVersion: v1
kind: Pod
metadata:
 name: frontend
spec:
 containers:
 - name: app
 image: images.my-company.example/app:v4
 resources:
 requests:
 ephemeral-storage: "8Gi"
 limits:
 ephemeral-storage: "12Gi"
 volumeMounts:
 - name: ephemeral
 mountPath: "/tmp"
 - name: log-aggregator
 image: images.my-company.example/log-aggregator:v6
 resources:
 requests:
 ephemeral-storage: "2Gi"
 volumeMounts:
 - name: ephemeral
 mountPath: "/tmp"
 volumes:
 - name: ephemeral
 emptyDir: {}
 sizeLimit: 5Gi
```
First of all, the scheduler ensures that the sum of the resource requests of the scheduled containers is less than the capacity of the node. In this case, the pod can be assigned to a node only if its available ephemeral storage (allocatable resource) has more than 10GiB.

Secondly, at container level, since one of the container sets resource limit, kubelet eviction manager will measure the disk usage of this container and evict the pod if the storage usage of the first container exceeds its limit (12GiB). At pod level, kubelet works out an overall Pod storage limit by adding up the limits of all the containers in that Pod. In this case, the total storage usage at pod level is the sum of the disk usage from all containers plus the Pod's emptyDirvolumes. If this total usage exceeds the overall Pod storage limit (12GiB), then the kubelet also marks the Pod for eviction.

Last, in this example, emptyDir volume sets its sizeLimit to 5Gi. It means that if this pod's emptyDir used up more local storage than 5GiB, the pod will be evicted from the node.

Setting resource quota and limitRange for local ephemeral storage

This feature adds two more resource quotas for storage. The request and limit set constraints on the total requests/limits of all containers’ in a namespace.
- requests.ephemeral-storage
- limits.ephemeral-storage
```
apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-resources
spec:
hard:
requests.ephemeral-storage: "10Gi"
limits.ephemeral-storage: "20Gi"
```
Similar to CPU and memory, admin could use LimitRange to set default container’s local storage request/limit, and/or minimum/maximum resource constraints for a namespace.

apiVersion: v1 kind: LimitRange metadata: name: storage-limit-range spec: limits:
- default: ephemeral-storage: 10Gi defaultRequest: ephemeral-storage: 5Gi type: Container
Also, ephemeral-storage may be specified to reserve for kubelet or system. example, --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=10Gi][,][pid=1000] --kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=5Gi][,][pid=1000]. If your cluster node root disk capacity is 100Gi, after setting system-reserved and kube-reserved value, the available allocatable ephemeral storage would become 85Gi. The schedule will use this information to assign pods based on request and allocatable resources from each node. The eviction manager will also use allocatable resource to determine pod eviction. See more details from Reserve Compute Resources for System Daemons

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.

We offer a huge thank you to all the contributors in Kubernetes Storage SIG and CSI community who helped review the design and implementation of the project, including but not limited to the following:
- Benjamin Elder (BenTheElder)
- Michelle Au (msau42)
- Tim Hockin (thockin)
- Jordan Liggitt (liggitt)
- Xing Yang (xing-yang)

6 / 19

Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta • sorangutan

What is a node shutdown in Kubernetes?

What is a graceful node shutdown?

What is a non-graceful node shutdown?

What's new for the beta?

How does it work?

What’s next?

How can I learn more?

How to get involved?

Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation • sorangutan

API

Scheduling

Limitations

Writing a resource driver

Running the test driver

Next steps

Blog: Kubernetes 1.26: Windows HostProcess Containers Are Generally Available • sorangutan

How does it work?

HostProcess containers in action

How do I use it?

How can I learn more?

How do I get involved?

Blog: Kubernetes 1.26: We're now signing our binary release artifacts! • sorangutan

Getting involved

Additional resources

Blog: Kubernetes v1.26: Electrifying • sorangutan

Release theme and logo

Major themes

Change in container image registry

CRI v1alpha2 removed

Storage improvements

CSI migration for Azure File and vSphere graduated to stable

Delegate FSGroup to CSI Driver graduated to stable

In-tree GlusterFS driver removal

In-tree OpenStack Cinder driver removal

Signing Kubernetes release artifacts graduates to beta

Support for Windows privileged containers graduates to stable

Improvements to Kubernetes metrics

Metrics framework extension graduates to alpha

Component Health Service Level Indicators graduates to alpha

Feature metrics are now available

Dynamic Resource Allocation graduates to alpha

CEL in Admission Control graduates to alpha

Pod scheduling improvements

PodSchedulingReadiness graduates to alpha

NodeInclusionPolicyInPodTopologySpread graduates to beta

Other Updates

Graduations to stable

Deprecations and removals

Release notes

Availability

Release team

User highlights

Ecosystem updates

Project velocity

Upcoming Release Webinar

Get Involved

Blog: Forensic container checkpointing in Kubernetes • sorangutan

How does it work?

Why is it important?

How do I use container checkpointing?

Usage example with CRI-O

Restore a checkpointed container outside of Kubernetes (with CRI-O)

Restore a checkpointed container within of Kubernetes

How do I get involved?

Blog: Finding suspicious syscalls with the seccomp notifier • sorangutan

Blog: Boosting Kubernetes container runtime observability with OpenTelemetry • sorangutan

Blog: registry.k8s.io: faster, cheaper and Generally Available (GA) • sorangutan

TL;DR: What you need to know about this change

Why has Kubernetes changed to a different image registry?

Why isn’t there a stable list of domains/IPs? Why can’t I restrict image pulls?

What kind of errors will I see? How will I know if I’m still using the old address?

I’m impacted by this change, how do I revert to the old registry address?

Reverting the registry name in kubeadm

Reverting the Registry Name in kubelet

Acknowledgments

Blog: Kubernetes Removals, Deprecations, and Major Changes in 1.26 • sorangutan

The Kubernetes API Removal and Deprecation process

A note about the removal of the CRI v1alpha2 API and containerd 1.5 support

Deprecations and removals in Kubernetes v1.26

Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta
• sorangutan

Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation
• sorangutan

Blog: Kubernetes 1.26: Windows HostProcess Containers Are Generally Available
• sorangutan

Blog: Kubernetes 1.26: We're now signing our binary release artifacts!
• sorangutan

Blog: Kubernetes v1.26: Electrifying
• sorangutan

`PodSchedulingReadiness` graduates to alpha

`NodeInclusionPolicyInPodTopologySpread` graduates to beta

Blog: Forensic container checkpointing in Kubernetes
• sorangutan

Blog: Finding suspicious syscalls with the seccomp notifier
• sorangutan

Blog: Boosting Kubernetes container runtime observability with OpenTelemetry
• sorangutan

Blog: registry.k8s.io: faster, cheaper and Generally Available (GA)
• sorangutan

Blog: Kubernetes Removals, Deprecations, and Major Changes in 1.26
• sorangutan

A note about the removal of the CRI `v1alpha2` API and containerd 1.5 support

Removal of the `v1beta1` flow control API group

Removal of the `v2beta2` HorizontalPodAutoscaler API

Removal of `kube-proxy` userspace modes

Deprecation of non-inclusive `kubectl` flag

Deprecations for `kube-apiserver` command line arguments

Deprecations for `kubectl run` command line arguments

Blog: Live and let live with Kluctl and Server Side Apply
• sorangutan

Blog: Server Side Apply Is Great And You Should Be Using It
• sorangutan

Blog: Current State: 2019 Third Party Security Audit of Kubernetes
• sorangutan

Blog: Introducing Kueue
• sorangutan