Blog: Kubernetes v1.28: Planternetes

sorangutan

Authors: Kubernetes v1.28 Release Team

Announcing the release of Kubernetes v1.28 Planternetes, the second release of 2023!

This release consists of 45 enhancements. Of those enhancements, 19 are entering Alpha, 14 have graduated to Beta, and 12 have graduated to Stable.

Release Theme And Logo

Kubernetes v1.28: Planternetes

The theme for Kubernetes v1.28 is Planternetes.

Each Kubernetes release is the culmination of the hard work of thousands of individuals from our community. The people behind this release come from a wide range of backgrounds, some of us industry veterans, parents, others students and newcomers to open-source. We combine our unique experience to create a collective artifact with global impact.

Much like a garden, our release has ever-changing growth, challenges and opportunities. This theme celebrates the meticulous care, intention and efforts to get the release to where we are today. Harmoniously together, we grow better.

What's New (Major Themes)

Changes to supported skew between control plane and node versions

This enables testing and expanding the supported skew between core node and control plane components by one version from n-2 to n-3, so that node components (kubelet and kube-proxy) for the oldest supported minor version work with control plane components (kube-apiserver, kube-scheduler, kube-controller-manager, cloud-controller-manager) for the newest supported minor version.

This is valuable for end users as control plane upgrade will be a little faster than node upgrade, which are almost always going to be the longer with running workloads.

The Kubernetes yearly support period already makes annual upgrades possible. Users can upgrade to the latest patch versions to pick up security fixes and do 3 sequential minor version upgrades once a year to "catch up" to the latest supported minor version.

However, since the tested/supported skew between nodes and control planes is currently limited to 2 versions, a 3-version upgrade would have to update nodes twice to stay within the supported skew.

Generally available: recovery from non-graceful node shutdown

If a node shuts down down unexpectedly or ends up in a non-recoverable state (perhaps due to hardware failure or unresponsive OS), Kubernetes allows you to clean up afterwards and allow stateful workloads to restart on a different node. For Kubernetes v1.28, that's now a stable feature.

This allows stateful workloads to failover to a different node successfully after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.

Versions of Kubernetes earlier than v1.20 lacked handling for node shutdown on Linux, the kubelet integrates with systemd and implements graceful node shutdown (beta, and enabled by default). However, even an intentional shutdown might not get handled well that could be because:

the node runs Windows
the node runs Linux, but uses a different init (not systemd)
the shutdown does not trigger the system inhibitor locks mechanism
because of a node-level configuration error (such as not setting appropriate values for shutdownGracePeriod and shutdownGracePeriodCriticalPods).

When a node shutdowns or fails, and that shutdown was not detected by the kubelet, the pods that are part of a StatefulSet will be stuck in terminating status on the shutdown node. If the stopped node restarts, the kubelet on that node can clean up (DELETE) the Pods that the Kubernetes API still sees as bound to that node. However, if the node stays stopped - or if the kubelet isn't able to start after a reboot - then Kubernetes may not be able to create replacement Pods. When the kubelet on the shut-down node is not available to delete the old pods, an associated StatefulSet cannot create a new pod (which would have the same name).

There's also a problem with storage. If there are volumes used by the pods, existing VolumeAttachments will not be disassociated from the original - and now shut down - node so the PersistentVolumes used by these pods cannot be attached to a different, healthy node. As a result, an application running on an affected StatefulSet may not be able to function properly. If the original, shut down node does come up, then their pods will be deleted by its kubelet and new pods can be created on a different running node. If the original node does not come up (common with an immutable infrastructure design), those pods would be stuck in a Terminating status on the shut-down node forever.

For more information on how to trigger cleanup after a non-graceful node shutdown, read non-graceful node shutdown.

Improvements to CustomResourceDefinition validation rules

The Common Expression Language (CEL) can be used to validate custom resources. The primary goal is to allow the majority of the validation use cases that might once have needed you, as a CustomResourceDefinition (CRD) author, to design and implement a webhook. Instead, and as a beta feature, you can add validation expressions directly into the schema of a CRD.

CRDs need direct support for non-trivial validation. While admission webhooks do support CRDs validation, they significantly complicate the development and operability of CRDs.

For more information, read validation rules in the CRD documentation.

ValidatingAdmissionPolicies graduate to beta

Common Expression language for admission control is customizable, in-process validation of requests to the Kubernetes API server as an alternative to validating admission webhooks.

This builds on the capabilities of the CRD Validation Rules feature that graduated to beta in 1.25 but with a focus on the policy enforcement capabilities of validating admission control.

This will lower the infrastructure barrier to enforcing customizable policies as well as providing primitives that help the community establish and adhere to the best practices of both K8s and its extensions.

To use ValidatingAdmissionPolicies, you need to enable the admissionregistration.k8s.io/v1beta1 API group in your cluster's control plane.

Match conditions for admission webhooks

Kubernetes v1.27 lets you specify match conditions for admission webhooks, which lets you narrow the scope of when Kubernetes makes a remote HTTP call at admission time. The matchCondition field for ValidatingWebhookConfiguration and MutatingWebhookConfiguration is a CEL expression that must evaluate to true for the admission request to be sent to the webhook.

In Kubernetes v1.28, that field moved to beta, and it's enabled by default.

To learn more, see matchConditions in the Kubernetes documentation.

Beta support for enabling swap space on Linux

This adds swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.

There are two distinct types of users for swap, who may overlap:

Node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbor issues.
Application developers, who have written applications that would benefit from using swap memory.

Mixed version proxy (alpha)

When a cluster has multiple API servers at mixed versions (such as during an upgrade/downgrade or when runtime-config changes and a rollout happens), not every apiserver can serve every resource at every version.

For Kubernetes v1.28, you can enable the mixed version proxy within the API server's aggregation layer. The mixed version proxy finds requests that the local API server doesn't recognize but another API server inside the control plan is able to support. Having found a suitable peer, the aggregation layer proxies the request to a compatible API server; this is transparent from the client's perspective.

When an upgrade or downgrade is performed on a cluster, for some period of time the API servers within the control plane may be at differing versions; when that happens, different subsets of the API servers are able to serve different sets of built-in resources (different groups, versions, and resources are all possible). This new alpha mechanism lets you hide that skew from clients.

Source code reorganization for control plane components

Kubernetes contributors have begun to reorganize the code for the kube-apiserver to build on a new staging repository that consumes k/apiserver but has a bigger, carefully chosen subset of the functionality of kube-apiserver such that it is reusable.

This is a gradual reorganization; eventually there will be a new git repository with generic functionality abstracted from Kubernetes' API server.

Support for CDI injection into containers (alpha)

CDI provides a standardized way of injecting complex devices into a container (i.e. devices that logically require more than just a single /dev node to be injected for them to work). This new feature enables plugin developers to utilize the CDIDevices field added to the CRI in 1.27 to pass CDI devices directly to CDI enabled runtimes (of which containerd and crio-o are in recent releases).

API awareness of sidecar containers (alpha)

Kubernetes 1.28 introduces an alpha restartPolicy field for init containers, and uses that to indicate when an init container is also a sidecar container. The will start init containers with restartPolicy: Always in the order they are defined, along with other init containers. Instead of waiting for that sidecar container to complete before starting the main container(s) for the Pod, the kubelet only waits for the sidecar init container to have started.

The condition for startup completion will be that the startup probe succeeded (or if no startup probe is defined) and postStart handler is completed. This condition is represented with the field Started of ContainerStatus type. See the section "Pod startup completed condition" for considerations on picking this signal.

For init containers, you can either omit the restartPolicy field, or set it to Always. Omitting the field means that you want a true init container that runs to completion before application startup.

Sidecar containers do not block Pod completion: if all regular containers are complete, sidecar containers in that Pod will be terminated.

For sidecar containers, the restart behavior is more complex than for init containers. In a Pod with restartPolicy set to Never, a sidecar container that fails during Pod startup will not be restarted and the whole Pod is treated as having failed. If the Pod's restartPolicy is Always or OnFailure, a sidecar that fails to start will be retried.

Once the sidecar container has started (process running, postStart was successful, and any configured startup probe is passing), and then there's a failure, that sidecar container will be restarted even when the Pod's overall restartPolicy is Never or OnFailure. Furthermore, sidecar containers will be restarted (on failure or on normal exit) even during Pod termination.

To learn more, read API for sidecar containers.

Automatic, retroactive assignment of a default StorageClass graduates to stable

Kubernetes automatically sets a storageClassName for a PersistentVolumeClaim (PVC) if you don't provide a value. The control plane also sets a StorageClass for any existing PVC that doesn't have a storageClassName defined. Previous versions of Kubernetes also had this behavior; for Kubernetes v1.28 is is automatic and always active; the feature has graduated to stable (general availability).

To learn more, read about StorageClass in the Kubernetes documentation.

Pod replacement policy for Jobs (alpha)

Kubernetes 1.28 adds a new field for the Job API that allows you to specify if you want the control plane to make new Pods as soon as the previous Pods begin termination (existing behavior), or only once the existing pods are fully terminated (new, optional behavior).

Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per index. With the older behaviour, if a pod that belongs to an Indexed Job enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created but then immediately fails to start due to the clash with the old pod that has not yet shut down.

Having a replacement Pod appear before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, early creation of replacement Pods might produce undesired scale-ups.

To learn more, read Delayed creation of replacement pods in the Job documentation.

Job retry backoff limit, per index (alpha)

This extends the Job API to support indexed jobs where the backoff limit is per index, and the Job can continue execution despite some of its indexes failing.

Currently, the indexes of an indexed job share a single backoff limit. When the job reaches this shared backoff limit, the job controller marks the entire job as failed, and the resources are cleaned up, including indexes that have yet to run to completion.

As a result, the existing implementation did not cover the situation where the workload is truly embarrassingly parallel: each index is fully independent of other indexes.

For instance, if indexed jobs were used as the basis for a suite of long-running integration tests, then each test run would only be able to find a single test failure.

For more information, read Handling Pod and container failures in the Kubernetes documentation.

CRI container and pod statistics without cAdvisor

This encompasses two related pieces of work (changes to the kubelet's /metrics/cadvisor endpoint and improvements to the replacement summary API).

There are two main APIs that consumers use to gather stats about running containers and pods: summary API and /metrics/cadvisor. The Kubelet is responsible for implementing the summary API, and cadvisor is responsible for fulfilling /metrics/cadvisor.

This enhances CRI implementations to be able to fulfill all the stats needs of Kubernetes. At a high level, there are two pieces of this:

It enhances the CRI API with enough metrics to supplement the pod and container fields in the summary API directly from CRI.
It enhances the CRI implementations to broadcast the required metrics to fulfill the pod and container fields in the /metrics/cadvisor endpoint.

Feature graduations and deprecations in Kubernetes v1.28

Graduations to stable

This release includes a total of 12 enhancements promoted to Stable:

Deprecations and removals

Removals:

Removal of CSI Migration for GCE PD

Deprecations:

Release Notes

The complete details of the Kubernetes v1.28 release are available in our release notes.

Availability

Kubernetes v1.28 is available for download on GitHub. To get started with Kubernetes, you can run local Kubernetes clusters using minikube, kind, etc. You can also easily install v1.28 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is comprised of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.28 release for our community.

Special thanks to our release lead, Grace Nguyen, for guiding us through a smooth and successful release cycle.

Ecosystem Updates

KubeCon + CloudNativeCon China 2023 will take place in Shanghai, China, from 26 – 28 September 2023! You can find more information about the conference and registration on the event site.
KubeCon + CloudNativeCon North America 2023 will take place in Chicago, Illinois, The United States of America, from 6 – 9 November 2023! You can find more information about the conference and registration on the event site.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.28 release cycle, which ran for 14 weeks (May 15 to August 15), we saw contributions from 911 companies and 1440 individuals.

Upcoming Release Webinar

Join members of the Kubernetes v1.28 release team on Friday, September 14, 2023, at 10 a.m. PDT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests.

Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:

Find out more about contributing to Kubernetes at the Kubernetes Contributors website.
Follow us on Twitter @Kubernetesio for the latest updates.
Join the community discussion on Discuss.
Join the community on Slack.
Post questions (or answer questions) on Server Fault.
Share your Kubernetes story.
Read more about what’s happening with Kubernetes on the blog.
Learn more about the Kubernetes Release Team.

https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/