Craig Ingram has graciously attempted over the years to keep track of the
status of the findings reported in the last audit in this issue:
kubernetes/kubernetes#81146.
This blog post will attempt to dive deeper into this, address any gaps
in tracking and become a point in time summary of the state of the
findings reported from 2019.
This article should also help readers gain confidence through transparent
communication, of work done by the community to address these findings and
bubble up any findings that need help from community contributors.
Current State
The status of each issue / finding here is represented in a best effort manner.
Authors do not claim to be 100% accurate on the status and welcome any
corrections or feedback if the current state is not reflected accurately by
commenting directly on the relevant issue.
Apart from fixes to the specific issues, the 2019 third party security audit
also motivated security focussed enhancements in the next few releases of
Kubernetes. One such example is
Kubernetes Enhancement Proposal (KEP) 1933 Defend Against Logging Secrets via Static Analysis to prevent exposing
secrets to logs with Patrick Rhomberg driving the
implementation. As a result of this KEP,
go-flow-levee, a taint propagation
analysis tool configured to detect logging of secrets, is executed in a
script
as a Prow presubmit job. This KEP was introduced in v1.20.0 as an alpha
feature, then graduated to beta in v1.21.0, and graduated to stable in
v1.23.0. As stable, the analysis runs as a blocking presubmit test. This
KEP also helped resolve the following issues from the 2019 third party security audit:
Many of the 37 findings identified were fixed by work from
our community members over the last 3 years. However, we still have some work
left to do. Here's a breakdown of remaining work with rough estimates on
time commitment, complexity and benefits to the ecosystem on fixing
these pending issues.
Note: Anything requiring a KEP (Kubernetes Enhancement Proposal) is considered
high time commitment and high complexity. Benefits to Ecosystem are
roughly equivalent to risk of keeping the finding unfixed which is
determined by Severity Level + Likelihood of a successful vulnerability
exploit. These estimates and values in the table below are the authors'
personal opinion. An individual or end users' threat model may rate the
benefits to fix a particular issue higher or lower.
Title
Issue
Time Commitment
Complexity
Benefit to Ecosystem
Kubernetes does not facilitate certificate revocation
To get started on fixing any of these findings that need help, please
consider getting involved in Kubernetes SIG
Security
by joining our bi-weekly meetings or hanging out with us on our Slack
Channel.
Authors: Abdullah Gharaibeh (Google), Aldo Culquicondor (Google)
Whether on-premises or in the cloud, clusters face real constraints for resource usage, quota, and cost management reasons. Regardless of the autoscalling capabilities, clusters have finite capacity. As a result, users want an easy way to fairly and
efficiently share resources.
In this article, we introduce Kueue,
an open source job queueing controller designed to manage batch jobs as a single unit.
Kueue leaves pod-level orchestration to existing stable components of Kubernetes.
Kueue natively supports the Kubernetes Job
API and offers hooks for integrating other custom-built APIs for batch jobs.
Why Kueue?
Job queueing is a key feature to run batch workloads at scale in both on-premises and cloud environments. The main goal
of job queueing is to manage access to a limited pool of resources shared by multiple tenants. Job queueing decides which
jobs should wait, which can start immediately, and what resources they can use.
Some of the most desired job queueing requirements include:
Quota and budgeting to control who can use what and up to what limit. This is not only needed in clusters with static resources like on-premises,
but it is also needed in cloud environments to control spend or usage of scarce resources.
Fair sharing of resources between tenants. To maximize the usage of available resources, any unused quota assigned to inactive tenants should be
allowed to be shared fairly between active tenants.
Flexible placement of jobs across different resource types based on availability. This is important in cloud environments which have heterogeneous
resources such as different architectures (GPU or CPU models) and different provisioning modes (spot vs on-demand).
Support for autoscaled environments where resources can be provisioned on demand.
Plain Kubernetes doesn't address the above requirements. In normal circumstances, once a Job is created, the job-controller instantly creates the
pods and kube-scheduler continuously attempts to assign the pods to nodes. At scale, this situation can work the control plane to death. There is
also currently no good way to control at the job level which jobs should get which resources first, and no way to express order or fair sharing. The
current ResourceQuota model is not a good fit for these needs because quotas are enforced on resource creation, and there is no queueing of requests. The
intent of ResourceQuotas is to provide a builtin reliability mechanism with policies needed by admins to protect clusters from failing over.
In the Kubernetes ecosystem, there are several solutions for job scheduling. However, we found that these alternatives have one or more of the following problems:
They replace existing stable components of Kubernetes, like kube-scheduler or the job-controller. This is problematic not only from an operational point of view, but
also the duplication in the job APIs causes fragmentation of the ecosystem and reduces portability.
They don't integrate with autoscaling, or
They lack support for resource flexibility.
How Kueue works
With Kueue we decided to take a different approach to job queueing on Kubernetes that is anchored around the following aspects:
Not duplicating existing functionalities already offered by established Kubernetes components for pod scheduling, autoscaling and job
lifecycle management.
Adding key features that are missing to existing components. For example, we invested in the Job API to cover more use cases like
IndexedJob and fixed long standing issues related to pod
tracking. While this path takes longer to
land features, we believe it is the more sustainable long term solution.
Ensuring compatibility with cloud environments where compute resources are elastic and heterogeneous.
For this approach to be feasible, Kueue needs knobs to influence the behavior of those established components so it can effectively manage
when and where to start a job. We added those knobs to the Job API in the form of two features:
Suspend field, which allows Kueue to signal to the job-controller
when to start or stop a Job.
Mutable scheduling directives, which allows Kueue to
update a Job's .spec.template.spec.nodeSelector before starting the Job. This way, Kueue can control Pod placement while still
delegating to kube-scheduler the actual pod-to-node scheduling.
Note that any custom job API can be managed by Kueue if that API offers the above two capabilities.
Resource model
Kueue defines new APIs to address the requirements mentioned at the beginning of this post. The three main APIs are:
ResourceFlavor: a cluster-scoped API to define resource flavor available for consumption, like a GPU model. At its core, a ResourceFlavor is
a set of labels that mirrors the labels on the nodes that offer those resources.
ClusterQueue: a cluster-scoped API to define resource pools by setting quotas for one or more ResourceFlavor.
LocalQueue: a namespaced API for grouping and managing single tenant jobs. In its simplest form, a LocalQueue is a pointer to the ClusterQueue
that the tenant (modeled as a namespace) can use to start their jobs.
For more details, take a look at the API concepts documentation. While the three APIs may look overwhelming,
most of Kueue’s operations are centered around ClusterQueue; the ResourceFlavor and LocalQueue APIs are mainly organizational wrappers.
Example use case
Imagine the following setup for running batch workloads on a Kubernetes cluster on the cloud:
You have cluster-autoscaler installed in the cluster to automatically
adjust the size of your cluster.
There are two types of autoscaled node groups that differ on their provisioning policies: spot and on-demand. The nodes of each group are
differentiated by the label instance-type=spot or instance-type=ondemand.
Moreover, since not all Jobs can tolerate running on spot nodes, the nodes are tainted with spot=true:NoSchedule.
To strike a balance between cost and resource availability, imagine you want Jobs to use up to 1000 cores of on-demand nodes, then use up to
2000 cores of spot nodes.
As an admin for the batch system, you define two ResourceFlavors that represent the two types of nodes:
Note that the order of flavors in the ClusterQueue resources matters: Kueue will attempt to fit jobs in the available quotas according to
the order unless the job has an explicit affinity to specific flavors.
For each namespace, you define a LocalQueue that points to the ClusterQueue above:
Admins create the above setup once. Batch users are able to find the queues they are allowed to
submit to by listing the LocalQueues in their namespace(s). The command is similar to the following: kubectl get -n my-namespace localqueues
To submit work, create a Job and set the kueue.x-k8s.io/queue-name annotation as follows:
Kueue intervenes to suspend the Job as soon as it is created. Once the Job is at the head of the ClusterQueue, Kueue evaluates if it can start
by checking if the resources requested by the job fit the available quota.
In the above example, the Job tolerates spot resources. If there are previously admitted Jobs consuming all existing on-demand quota but
not all of spot’s, Kueue admits the Job using the spot quota. Kueue does this by issuing a single update to the Job object that:
Changes the .spec.suspend flag to false
Adds the term instance-type: spot to the job's .spec.template.spec.nodeSelector so that when the pods are created by the job controller, those pods can only schedule
onto spot nodes.
Finally, if there are available empty nodes with matching node selector terms, then kube-scheduler will directly schedule the pods. If not, then
kube-scheduler will initially mark the pods as unschedulable, which will trigger the cluster-autoscaler to provision new nodes.
Future work and getting involved
The example above offers a glimpse of some of Kueue's features including support for quota, resource flexibility, and integration with cluster
autoscaler. Kueue also supports fair-sharing, job priorities, and different queueing strategies. Take a look at the
Kueue documentation to learn more about those features and how to use Kueue.
We have a number of features that we plan to add to Kueue, such as hierarchical quota, budgets, and support for dynamically sized jobs. In
the more immediate future, we are focused on adding support for job preemption.
The latest Kueue release is available on Github;
try it out if you run batch workloads on Kubernetes (requires v1.22 or newer).
We are in the early stages of this project and we are seeking feedback of all levels, major or minor, so please don’t hesitate to reach out. We’re
also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation. You can get in touch with
us via our repo, mailing list or on
Slack.
Last but not least, thanks to all our contributors who made this project possible!
Authors: Rodrigo Campos (Microsoft), Giuseppe Scrivano (Red Hat)
Kubernetes v1.25 introduces the support for user namespaces.
This is a major improvement for running secure workloads in
Kubernetes. Each pod will have access only to a limited subset of the
available UIDs and GIDs on the system, thus adding a new security
layer to protect from other pods running on the same system.
How does it work?
A process running on Linux can use up to 4294967296 different UIDs and
GIDs.
User namespaces is a Linux feature that allows mapping a set of users
in the container to different users in the host, thus restricting what
IDs a process can effectively use.
Furthermore, the capabilities granted in a new user namespace do not
apply in the host initial namespaces.
Why is it important?
There are mainly two reasons why user namespaces are important:
improve security since they restrict the IDs a pod can use, so each
pod can run in its own separate environment with unique IDs.
enable running workloads as root in a safer manner.
In a user namespace we can map the root user inside the pod to a
non-zero ID outside the container, containers believe in running as
root while they are a regular unprivileged ID from the host point of
view.
The process can keep capabilities that are usually restricted to
privileged pods and do it in a safe way since the capabilities granted
in a new user namespace do not apply in the host initial namespaces.
How do I enable user namespaces?
At the moment, user namespaces support is opt-in, so you must enable
it for a pod setting hostUsers to false under the pod spec stanza:
Immutable fields can be found in a few places in the built-in Kubernetes types.
For example, you can't change the .metadata.name of an object. Specific objects
have fields where changes to existing objects are constrained; for example, the
.spec.selector of a Deployment.
Aside from simple immutability, there are other common design patterns such as
lists which are append-only, or a map with mutable values and immutable keys.
Until recently the best way to restrict field mutability for CustomResourceDefinitions
has been to create a validating
admission webhook:
this means a lot of complexity for the common case of making a field immutable.
Beta since Kubernetes 1.25, CEL Validation Rules allow CRD authors to express
validation constraints on their fields using a rich expression language,
CEL. This article explores how you can
use validation rules to implement a few common immutability patterns directly in
the manifest for a CRD.
Basics of validation rules
The new support for CEL validation rules in Kubernetes allows CRD authors to add
complicated admission logic for their resources without writing any code!
For example, A CEL rule to constrain a field maximumSize to be greater than a
minimumSize for a CRD might look like the following:
rule: |
self.maximumSize > self.minimumSize
message: 'Maximum size must be greater than minimum size.'
The rule field contains an expression written in CEL. self is a special keyword
in CEL which refers to the object whose type contains the rule.
The message field is an error message which will be sent to Kubernetes clients
whenever this particular rule is not satisfied.
For more details about the capabilities and limitations of Validation Rules using
CEL, please refer to
validation rules.
The CEL specification is also a good
reference for information specifically related to the language.
Immutability patterns with CEL validation rules
This section implements several common use cases for immutability in Kubernetes
CustomResourceDefinitions, using validation rules expressed as
kubebuilder marker comments.
Resultant OpenAPI generated by the kubebuilder marker comments will also be
included so that if you are writing your CRD manifests by hand you can still
follow along.
Project setup
To use CEL rules with kubebuilder comments, you first need to set up a Golang
project structure with the CRD defined in Go.
You may skip this step if you are not using kubebuilder or are only interested
in the resultant OpenAPI extensions.
Begin with a folder structure of a Go module set up like the following. If
you have your own project already set up feel free to adapt this tutorial to your liking:
types.go contains all type definitions in stable.example.com/v1
package v1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// An empty CRD as an example of defining a type using controller tools
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
type TestCRD struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec TestCRDSpec `json:"spec,omitempty"`
Status TestCRDStatus `json:"status,omitempty"`
}
type TestCRDStatus struct {}
type TestCRDSpec struct {
// You will fill this in as you go along
}
tools.go contains a dependency on controller-gen which will be used to generate the CRD definition:
//go:build tools
package celimmutabilitytutorial
// Force direct dependency on code-generator so that it may be executed with go run
import (
_ "sigs.k8s.io/controller-tools/cmd/controller-gen"
)
Finally, generate.gocontains a go:generate directive to make use of
controller-gen. controller-gen parses our types.go and creates generates
CRD yaml files into a crd folder:
package celimmutabilitytutorial
//go:generate go run sigs.k8s.io/controller-tools/cmd/controller-gen crd paths=./pkg/apis/... output:dir=./crds
You may now want to add dependencies for our definitions and test the code generation:
cd cel-immutability-tutorial
go mod init <your-org>/<your-module-name>
go mod tidy
go generate ./...
After running these commands you now have completed the basic project structure.
Your folder tree should look like the following:
The manifest for the example CRD is now available in crds/stable.example.com_testcrds.yaml.
Immutablility after first modification
A common immutability design pattern is to make the field immutable once it has
been first set. This example will throw a validation error if the field after
changes after being first initialized.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type ImmutableSinceFirstWrite struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
// +kubebuilder:validation:MaxLength=512
Value string `json:"value"`
}
The +kubebuilder directives in the comments inform controller-gen how to
annotate the generated OpenAPI. The XValidation rule causes the rule to appear
among the x-kubernetes-validations OpenAPI extension. Kubernetes then
respects the OpenAPI spec to enforce our constraints.
To enforce a field's immutability after its first write, you need to apply the following constraints:
Field must be allowed to be initially unset +kubebuilder:validation:Optional
Once set, field must not be allowed to be removed: !has(oldSelf.value) | has(self.value) (type-scoped rule)
Once set, field must not be allowed to change value self == oldSelf (field-scoped rule)
Also note the additional directive +kubebuilder:validation:MaxLength. CEL
requires that all strings have attached max length so that it may estimate the
computation cost of the rule. Rules that are too expensive will be rejected.
For more information on CEL cost budgeting, check out the other tutorial.
Example usage
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincefirstwrites.yaml
customresourcedefinition.apiextensions.k8s.io/immutablesincefirstwrites.stable.example.com created
Creating initial empty object with no value is permitted since value is optional:
The ImmutableSinceFirstWrite "test1" is invalid: <nil>: Invalid value: "object": Value is required once set
Generated schema
Note that in the generated schema there are two separate rule locations.
One is directly attached to the property immutable_since_first_write.
The other rule is associated with the crd type itself.
openAPIV3Schema:
properties:
value:
maxLength: 512
type: string
x-kubernetes-validations:
- message: Value is immutable
rule: self == oldSelf
type: object
x-kubernetes-validations:
- message: Value is required once set
rule: '!has(oldSelf.value) || has(self.value)'
Immutability upon object creation
A field which is immutable upon creation time is implemented similarly to the
earlier example. The difference is that that field is marked required, and the
type-scoped rule is no longer necessary.
type ImmutableSinceCreation struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Required
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
// +kubebuilder:validation:MaxLength=512
Value string `json:"value"`
}
This field will be required when the object is created, and after that point will
not be allowed to be modified. Our CEL Validation Rule self == oldSelf
Usage example
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincecreations.yaml
customresourcedefinition.apiextensions.k8s.io/immutablesincecreations.stable.example.com created
Applying an object without the required field should fail:
The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation
Now that the field has been added, the operation is permitted:
immutablesincecreation.stable.example.com/test1 created
If you attempt to change the value, the operation is blocked due to the
validation rules in the CRD. Note that the error message is as it was defined
in the validation rule.
The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation
Generated schema
openAPIV3Schema:
properties:
value:
maxLength: 512
type: string
x-kubernetes-validations:
- message: Value is immutable
rule: self == oldSelf
required:
- value
type: object
Append-only list of containers
In the case of ephemeral containers on Pods, Kubernetes enforces that the
elements in the list are immutable, and can’t be removed. The following example
shows how you could use CEL to achieve the same behavior.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type AppendOnlyList struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
// +kubebuilder:validation:MaxItems=100
// +kubebuilder:validation:XValidation:rule="oldSelf.all(x, x in self)",message="Values may only be added"
Values []v1.EphemeralContainer `json:"value"`
}
Once set, field must not be deleted: !has(oldSelf.value) || has(self.value) (type-scoped)
Once a value is added it is not removed: oldSelf.all(x, x in self) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional
Note that for cost-budgeting purposes, MaxItems is also required to be specified.
Example usage
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_appendonlylists.yaml
customresourcedefinition.apiextensions.k8s.io/appendonlylists.stable.example.com created
Creating an inital list with one element inside should succeed without problem:
The AppendOnlyList "testlist" is invalid: <nil>: Invalid value: "object": Value is required once set
Generated schema
openAPIV3Schema:
properties:
value:
items: ...
maxItems: 100
type: array
x-kubernetes-validations:
- message: Values may only be added
rule: oldSelf.all(x, x in self)
type: object
x-kubernetes-validations:
- message: Value is required once set
rule: '!has(oldSelf.value) || has(self.value)'
Map with append-only keys, immutable values
// A map which does not allow keys to be removed or their values changed once set. New keys may be added, however.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.values) || has(self.values)", message="Value is required once set"
type MapAppendOnlyKeys struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
// +kubebuilder:validation:MaxProperties=10
// +kubebuilder:validation:XValidation:rule="oldSelf.all(key, key in self && self[key] == oldSelf[key])",message="Keys may not be removed and their values must stay the same"
Values map[string]string `json:"values,omitempty"`
}
Once set, field must not be deleted: !has(oldSelf.values) || has(self.values) (type-scoped)
Once a key is added it is not removed nor is its value modified: oldSelf.all(key, key in self && self[key] == oldSelf[key]) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional
Example usage
Generating and installing the CRD should succeed:
# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_mapappendonlykeys.yaml
customresourcedefinition.apiextensions.k8s.io/mapappendonlykeys.stable.example.com created
Creating an initial object with one key within values should be permitted:
The MapAppendOnlyKeys "testmap" is invalid: values: Invalid value: "object": Keys may not be removed and their values must stay the same
If the entire field is removed, the other validation rule is triggered and the
operation is prevented. Note that the error message for the validation rule is
shown to the user.
The MapAppendOnlyKeys "testmap" is invalid: <nil>: Invalid value: "object": Value is required once set
Generated schema
openAPIV3Schema:
description: A map which does not allow keys to be removed or their values
changed once set. New keys may be added, however.
properties:
values:
additionalProperties:
type: string
maxProperties: 10
type: object
x-kubernetes-validations:
- message: Keys may not be removed and their values must stay the same
rule: oldSelf.all(key, key in self && self[key] == oldSelf[key])
type: object
x-kubernetes-validations:
- message: Value is required once set
rule: '!has(oldSelf.values) || has(self.values)'
Going further
The above examples showed how CEL rules can be added to kubebuilder types.
The same rules can be added directly to OpenAPI if writing a manifest for a CRD by hand.
For native types, the same behavior can be achieved using kube-openapi’s marker
+validations.
Usage of CEL within Kubernetes Validation Rules is so much more powerful than
what has been shown in this article. For more information please check out
validation rules
in the Kubernetes documentation and CRD Validation Rules Beta blog post.
The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure has already been beta since v1.17. CSI migration was introduced as alpha in Kubernetes v1.14.
Since then, SIG Storage and other Kubernetes special interest groups are working to ensure feature stability and compatibility in preparation for CSI Migration feature to go GA.
SIG Storage is excited to announce that the core CSI Migration feature is generally available in Kubernetes v1.25 release!
SIG Storage wrote a blog post in v1.23 for CSI Migration status update which discussed the CSI migration status for each storage driver. It has been a while and this article is intended to give a latest status update on each storage driver for their CSI Migration status in Kubernetes v1.25.
Quick recap: What is CSI Migration, and why migrate?
The Container Storage Interface (CSI) was designed to help Kubernetes replace its existing, in-tree storage driver mechanisms - especially vendor specific plugins.
Kubernetes support for the Container Storage Interface has been
generally available since Kubernetes v1.13.
Support for using CSI drivers was introduced to make it easier to add and maintain new integrations between Kubernetes and storage backend technologies. Using CSI drivers allows for better maintainability (driver authors can define their own release cycle and support lifecycle) and reduce the opportunity for vulnerabilities (with less in-tree code, the risks of a mistake are reduced, and cluster operators can select only the storage drivers that their cluster requires).
As more CSI Drivers were created and became production ready, SIG Storage wanted all Kubernetes users to benefit from the CSI model. However, we could not break API compatibility with the existing storage API types due to k8s architecture conventions. The solution we came up with was CSI migration: a feature that translates in-tree APIs to equivalent CSI APIs and delegates operations to a replacement CSI driver.
The CSI migration effort enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver from the storage backend.
If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. Existing StorageClass, PersistentVolume and PersistentVolumeClaim objects should continue to work.
When a Kubernetes cluster administrator updates a cluster to enable CSI migration, existing workloads that utilize PVCs which are backed by in-tree storage plugins will continue to function as they always have.
However, behind the scenes, Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers.
For example, suppose you are a kubernetes.io/gce-pd user; after CSI migration, you can still use kubernetes.io/gce-pd to provision new volumes, mount existing GCE-PD volumes or delete existing volumes. All existing APIs and Interface will still function correctly. However, the underlying function calls are all going through the GCE PD CSI driver instead of the in-tree Kubernetes function.
This enables a smooth transition for end users. Additionally as storage plugin developers, we can reduce the burden of maintaining the in-tree storage plugins and eventually remove them from the core Kubernetes binary.
What is the timeline / status?
The current and targeted releases for each individual driver is shown in the table below:
Driver
Alpha
Beta (in-tree deprecated)
Beta (on-by-default)
GA
Target "in-tree plugin" removal
AWS EBS
1.14
1.17
1.23
1.25
1.27 (Target)
Azure Disk
1.15
1.19
1.23
1.24
1.26 (Target)
Azure File
1.15
1.21
1.24
1.26 (Target)
1.28 (Target)
Ceph FS
1.26 (Target)
Ceph RBD
1.23
1.26 (Target)
1.27 (Target)
1.28 (Target)
1.30 (Target)
GCE PD
1.14
1.17
1.23
1.25
1.27 (Target)
OpenStack Cinder
1.14
1.18
1.21
1.24
1.26 (Target)
Portworx
1.23
1.25
1.26 (Target)
1.27 (Target)
1.29 (Target)
vSphere
1.18
1.19
1.25
1.26 (Target)
1.28 (Target)
The following storage drivers will not have CSI migration support.
The scaleio, flocker, quobyte and storageos drivers were removed; the others are deprecated and will be removed from core Kubernetes in the coming releases.
Driver
Deprecated
Code Removal
Flocker
1.22
1.25
GlusterFS
1.25
1.26 (Target)
Quobyte
1.22
1.25
ScaleIO
1.16
1.22
StorageOS
1.22
1.25
What does it mean for the core CSI Migration feature to go GA?
Core CSI Migration goes to GA means that the general framework, core library and API for CSI migration is
stable for Kubernetes v1.25 and will be part of future Kubernetes releases as well.
If you are a Kubernetes distribution maintainer, this means if you disabled CSIMigration feature gate previously, you are no longer allowed to do so because the feature gate has been locked.
If you are a Kubernetes storage driver developer, this means you can expect no backwards incompatibility changes in the CSI migration library.
If you are a Kubernetes maintainer, expect nothing changes from your day to day development flows.
If you are a Kubernetes user, expect nothing to change from your day-to-day usage flows. If you encounter any storage related issues, contact the people who operate your cluster (if that's you, contact the provider of your Kubernetes distribution, or get help from the community).
What does it mean for the storage driver CSI migration to go GA?
Storage Driver CSI Migration goes to GA means that the specific storage driver supports CSI Migration. Expect feature parity between the in-tree plugin with the CSI driver.
If you are a Kubernetes distribution maintainer, make sure you install the corresponding
CSI driver on the distribution. And make sure you are not disabling the specific CSIMigration{provider} flag, as they are locked.
If you are a Kubernetes storage driver maintainer, make sure the CSI driver can ensure feature parity if it supports CSI migration.
If you are a Kubernetes maintainer/developer, expect nothing to change from your day-to-day development flows.
If you are a Kubernetes user, the CSI Migration feature should be completely transparent
to you, the only requirement is to install the corresponding CSI driver.
What's next?
We are expecting cloud provider in-tree storage plugins code removal to start to happen as part of the v1.26 and v1.27 releases of Kubernetes. More and more drivers that support CSI migration will go GA in the upcoming releases.
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. We offer a huge thank you to the contributors who stepped up these last quarters to help move the project forward:
Xing Yang (xing-yang)
Hemant Kumar (gnufied)
Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CSI migration feature:
Andy Zhang (andyzhangz)
Divyen Patel (divyenpatel)
Deep Debroy (ddebroy)
Humble Devassy Chirammal (humblec)
Ismail Alidzhikov (ialidzhikov)
Jordan Liggitt (liggitt)
Matthew Cary (mattcary)
Matthew Wong (wongma7)
Neha Arora (nearora-msft)
Oksana Naumov (trierra)
Saad Ali (saad-ali)
Michelle Au (msau42)
Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.
Validation rules make it possible to declare how custom resources are validated using the Common Expression Language (CEL). For example:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
...
openAPIV3Schema:
type: object
properties:
spec:
type: object
x-kubernetes-validations:
- rule: "self.minReplicas <= self.replicas && self.replicas <= self.maxReplicas"
message: "replicas should be in the range minReplicas..maxReplicas."
properties:
replicas:
type: integer
...
Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:
Validation Rule
Purpose
self.minReplicas <= self.replicas
Validate an integer field is less than or equal to another integer field
'Available' in self.stateCounts
Validate an entry with the 'Available' key exists in a map
self.set1.all(e, !(e in self.set2))
Validate that the elements of two sets are disjoint
self == oldSelf
Validate that a required field is immutable once it is set
self.created + self.ttl < self.expired
Validate that 'expired' date is after a 'create' date plus a 'ttl' duration
Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.
Why CEL?
CEL was chosen as the language for validation rules for a couple reasons:
CEL expressions can easily be inlined into CRD schemas. They are sufficiently expressive to replace the vast majority of CRD validation checks currently implemented in admission webhooks. This results in CRDs that are self-contained and are easier to understand.
CEL expressions are compiled and type checked against a CRD's schema "ahead-of-time" (when CRDs are created and updated) allowing them to be evaluated efficiently and safely "runtime" (when custom resources are validated). Even regex string literals in CEL are validated and pre-compiled when CRDs are created or updated.
Why not use validation webhooks?
Benefits of using validation rules when compared with validation webhooks:
CRD authors benefit from a simpler workflow since validation rules eliminate the need to develop and maintain a webhook.
Cluster administrators benefit by no longer having to install, upgrade and operate webhooks for the purposes of CRD validation.
Cluster operability improves because CRD validation no longer requires a remote call to a webhook endpoint, eliminating a potential point of failure in the request-serving-path of the Kubernetes API server. This allows clusters to retain high availability while scaling to larger amounts of installed CRD extensions, since expected control plane availability would otherwise decrease with each additional webhook installed.
Getting started with validation rules
Writing validation rules in OpenAPIv3 schemas
You can define validation rules for any level of a CRD's OpenAPIv3 schema. Validation rules are automatically scoped to their location in the schema where they are declared.
Good practices for CRD validation rules:
Scope validation rules as close as possible to the fields(s) they validate.
Use multiple rules when validating independent constraints.
Do not use validation rules for validations already
Use OpenAPIv3 value validations (maxLength, maxItems, maxProperties, required, enum, minimum, maximum, ..) and string formats where available.
Use x-kubernetes-int-or-string, x-kubernetes-embedded-type and x-kubernetes-list-type=(set|map) were appropriate.
Examples of good practice:
Validation
Best Practice
Example(s)
Validate an integer is between 0 and 100.
Use OpenAPIv3 value validations.
type: integer minimum: 0 maximum: 100
Constraint the max size limits on maps (objects with additionalProperties), arrays and string.
Use OpenAPIv3 value validations. Recommended for all maps, arrays and strings. This best practice is essential for rule cost estimation (explained below).
type: maxItems: 100
Require a date-time be more recent than a particular timestamp.
Use OpenAPIv3 string formats to declare that the field is a date-time. Use validation rules to compare it to a particular timestamp.
Use x-kubernetes-list-type to validate that the arrays are sets. Use validation rules to validate the sets are disjoint.
type: object properties: set1: type: array x-kubernetes-list-type: set set2: ... x-kubernetes-validations: - rule: "!self.set1.all(e, !(e in self.set2))"
CRD transition rules
Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.
Transition rule examples:
Transition Rule
Purpose
self == oldSelf
For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.
(on parent of field) has(self.field) == has(oldSelf.field) on field: self == oldSelf
Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).
self.all(x, x in oldSelf)
Only allow adding items to a field that represents a set (prevent removals).
self >= oldSelf
Validate that a number is monotonically increasing.
Using the Functions Libraries
Validation rules have access to a couple different function libraries:
isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']
Validate that a URL has an allowed hostname.
self.map(x, x.weight).sum() == 1
Validate that the weights of a list of objects sum to 1.
int(self.find('^[0-9]*')) < 100
Validate that a string starts with a number less than 100.
self.isSorted()
Validates that a list is sorted.
Resource use and limits
To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.
Estimated cost limit
CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n^2) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.
Good practice:
Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.
Runtime cost limits for CRD validation rules
In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.
With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.
CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.
Future work
We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in an upcoming Kubernetes release!
There is a growing community of Kubernetes contributors thinking about how to make it possible to write extensible admission controllers using CEL as a substitute for admission webhooks for policy enforcement use cases. Anyone interested should reach out to us on the usual SIG API Machinery channels or via slack at #sig-api-machinery-cel-dev.
Acknowledgements
Special thanks to Cici Huang, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to Validation Rules!
Author: Humble Chirammal (Red Hat), Louis Koo (deeproute.ai)
Kubernetes v1.25, released earlier this month, introduced a new feature
that lets your cluster expand storage volumes, even when access to those
volumes requires a secret (for example: a credential for accessing a SAN fabric)
to perform node expand operation. This new behavior is in alpha and you
must enable a feature gate (CSINodeExpandSecret) to make use of it.
You must also be using CSI
storage; this change isn't relevant to storage drivers that are built in to Kubernetes.
To turn on this new, alpha feature, you enable the CSINodeExpandSecret feature
gate for the kube-apiserver and kubelet, which turns on a mechanism to send secretRef
configuration as part of NodeExpansion by the CSI drivers thus make use of
the same to perform node side expansion operation with the underlying
storage system.
What is this all about?
Before Kubernetes v1.24, you were able to define a cluster-level StorageClass
that made use of StorageClass Secrets,
but you didn't have any mechanism to specify the credentials that would be used for
operations that take place when the storage was mounted onto a node and when
the volume has to be expanded at node side.
The Kubernetes CSI already implemented a similar mechanism specific kinds of
volume resizes; namely, resizes of PersistentVolumes where the resizes take place
independently from any node referred as Controller Expansion. In that case, you
associate a PersistentVolume with a Secret that contains credentials for volume resize
actions, so that controller expansion can take place. CSI also supports a nodeExpandVolume
operation which CSI drivers can make use independent of Controller Expansion or along with
Controller Expansion on which, where the resize is driven from a node in your cluster where
the volume is attached. Please read Kubernetes 1.24: Volume Expansion Now A Stable Feature
At times, the CSI driver needs to check the actual size of the backend block storage (or image)
before proceeding with a node-level filesystem expand operation. This avoids false positive returns
from the backend storage cluster during filesystem expands.
When a PersistentVolume represents encrypted block storage (for example using LUKS)
you need to provide a passphrase in order to expand the device, and also to make it possible
to grow the filesystem on that device.
For various validations at time of node expansion, the CSI driver has to be connected
to the backend storage cluster. If the nodeExpandVolume request includes a secretRef
then the CSI driver can make use of the same and connect to the storage cluster to
perform the cluster operations.
How does it work?
To enable this functionality from this version of Kubernetes, SIG Storage have introduced
a new feature gate called CSINodeExpandSecret. Once the feature gate is enabled
in the cluster, NodeExpandVolume requests can include a secretRef field. The NodeExpandVolume request
is part of CSI; for example, in a request which has been sent from the Kubernetes
control plane to the CSI driver.
As a cluster operator, you admin can specify these secrets as an opaque parameter in a StorageClass,
the same way that you can already specify other CSI secret data. The StorageClass needs to have some
CSI-specific parameters set. Here's an example of those parameters:
If feature gates are enabled and storage class carries the above secret configuration,
the CSI provisioner receives the credentials from the Secret as part of the NodeExpansion request.
CSI volumes that require secrets for online expansion will have NodeExpandSecretRef
field set. If not set, the NodeExpandVolume CSI RPC call will be made without a secret.
Trying it out
Enable the CSINodeExpandSecret feature gate (please refer to
Feature Gates).
Create a Secret, and then a StorageClass that uses that Secret.
Here's an example manifest for a Secret that holds credentials:
Here's an example manifest for a StorageClass that refers to those credentials:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-blockstorage-sc
parameters:
csi.storage.k8s.io/node-expand-secret-name: test-secret # the name of the Secret
csi.storage.k8s.io/node-expand-secret-namespace: default # the namespace that the Secret is in
provisioner: blockstorage.cloudprovider.example
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
Example output
If the PersistentVolumeClaim (PVC) was created successfully, you can see that
configuration within the spec.csi field of the PersistentVolume (look for
spec.csi.nodeExpandSecretRef).
Check that it worked by running kubectl get persistentvolume <pv_name> -o yaml.
You should see something like.
If you then trigger online storage expansion, the kubelet passes the appropriate credentials
to the CSI driver, by loading that Secret and passing the data to the storage driver.
As this feature is still in alpha, Kubernetes Storage SIG expect to update or get feedback from CSI driver
authors with more tests and implementation. The community plans to eventually
promote the feature to Beta in upcoming releases.
Get involved or learn more?
The enhancement proposal includes lots of detail about the history and technical
implementation of this feature.
Please get involved by joining the Kubernetes
Storage SIG
(Special Interest Group) to help us enhance this feature.
There are a lot of good ideas already and we'd be thrilled to have more!
Local ephemeral storage capacity isolation was introduced as a alpha feature in Kubernetes 1.7 and it went beta in 1.9. With Kubernetes 1.25 we are excited to announce general availability(GA) of this feature.
Pods use ephemeral local storage for scratch space, caching, and logs. The lifetime of local ephemeral storage does not extend beyond the life of the individual pod. It is exposed to pods using the container’s writable layer, logs directory, and EmptyDir volumes. Before this feature was introduced, there were issues related to the lack of local storage accounting and isolation, such as Pods not knowing how much local storage is available and being unable to request guaranteed local storage. Local storage is a best-effort resource and pods can be evicted due to other pods filling the local storage.
The local storage capacity isolation feature allows users to manage local ephemeral storage in the same way as managing CPU and memory. It provides support for capacity isolation of shared storage between pods, such that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of shared storage exceeds that limit. It also allows setting ephemeral storage requests for resource reservation. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.
How to use local storage capacity isolation
A typical configuration for local ephemeral storage is to place all different kinds of ephemeral local data (emptyDir volumes, writeable layers, container images, logs) into one filesystem. Typically, both /var/lib/kubelet and /var/log are on the system's root filesystem. If users configure the local storage in different ways, kubelet might not be able to correctly measure disk usage and use this feature.
Setting requests and limits for local ephemeral storage
You can specify ephemeral-storage for managing local ephemeral storage. Each container of a Pod can specify either or both of the following:
In the following example, the Pod has two containers. The first container has a request of 8GiB of local ephemeral storage and a limit of 12GiB. The second container requests 2GiB of local storage, but no limit setting. Therefore, the Pod requests a total of 10GiB (8GiB+2GiB) of local ephemeral storage and enforces a limit of 12GiB of local ephemeral storage. It also sets emptyDir sizeLimit to 5GiB. With this setting in pod spec, it will affect how the scheduler makes a decision on scheduling pods and also how kubelet evict pods.
First of all, the scheduler ensures that the sum of the resource requests of the scheduled containers is less than the capacity of the node. In this case, the pod can be assigned to a node only if its available ephemeral storage (allocatable resource) has more than 10GiB.
Secondly, at container level, since one of the container sets resource limit, kubelet eviction manager will measure the disk usage of this container and evict the pod if the storage usage of the first container exceeds its limit (12GiB). At pod level, kubelet works out an overall Pod storage limit by
adding up the limits of all the containers in that Pod. In this case, the total storage usage at pod level is the sum of the disk usage from all containers plus the Pod's emptyDirvolumes. If this total usage exceeds the overall Pod storage limit (12GiB), then the kubelet also marks the Pod for eviction.
Last, in this example, emptyDir volume sets its sizeLimit to 5Gi. It means that if this pod's emptyDir used up more local storage than 5GiB, the pod will be evicted from the node.
Setting resource quota and limitRange for local ephemeral storage
This feature adds two more resource quotas for storage. The request and limit set constraints on the total requests/limits of all containers’ in a namespace.
Similar to CPU and memory, admin could use LimitRange to set default container’s local storage request/limit, and/or minimum/maximum resource constraints for a namespace.
Also, ephemeral-storage may be specified to reserve for kubelet or system. example, --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=10Gi][,][pid=1000] --kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=5Gi][,][pid=1000]. If your cluster node root disk capacity is 100Gi, after setting system-reserved and kube-reserved value, the available allocatable ephemeral storage would become 85Gi. The schedule will use this information to assign pods based on request and allocatable resources from each node. The eviction manager will also use allocatable resource to determine pod eviction. See more details from Reserve Compute Resources for System Daemons
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.
We offer a huge thank you to all the contributors in Kubernetes Storage SIG and CSI community who helped review the design and implementation of the project, including but not limited to the following:
Authors: Ravi Gudimetla (Apple), Filip Křepinský (Red Hat), Maciej Szulik (Red Hat)
This blog describes the two features namely minReadySeconds for StatefulSets and maxSurge for DaemonSets that SIG Apps is happy to graduate to stable in Kubernetes 1.25.
Specifying minReadySeconds slows down a rollout of a StatefulSet, when using a RollingUpdate value in .spec.updateStrategy field, by waiting for each pod for a desired time.
This time can be used for initializing the pod (e.g. warming up the cache) or as a delay before acknowledging the pod.
maxSurge allows a DaemonSet workload to run multiple instances of the same pod on a node during a rollout when using a RollingUpdate value in .spec.updateStrategy field.
This helps to minimize the downtime of the DaemonSet for consumers.
These features were already available in a Deployment and other workloads. This graduation helps to align this functionality across the workloads.
What problems do these features solve?
minReadySeconds for StatefulSets
minReadySeconds ensures that the StatefulSet workload is Ready for the given number of seconds before reporting the
pod as Available. The notion of being Ready and Available is quite important for workloads. For example, some workloads, like Prometheus with multiple instances of Alertmanager, should be considered Available only when the Alertmanager's state transfer is complete. minReadySeconds also helps when using loadbalancers with cloud providers. Since the pod should be Ready for the given number of seconds, it provides buffer time to prevent killing pods in rotation before new pods show up.
maxSurge for DaemonSets
Kubernetes system-level components like CNI, CSI are typically run as DaemonSets. These components can have impact on the availability of the workloads if those DaemonSets go down momentarily during the upgrades. The feature allows DaemonSet pods to temporarily increase their number, thereby ensuring zero-downtime for the DaemonSets.
Please note that the usage of hostPort in conjunction with maxSurge in DaemonSets is not allowed as DaemonSet pods are tied to a single node and two active pods cannot share the same port on the same node.
How does it work?
minReadySeconds for StatefulSets
The StatefulSet controller watches for the StatefulSet pods and counts how long a particular pod has been in the Running state, if this value is greater than or equal to the time specified in .spec.minReadySeconds field of the StatefulSet, the StatefulSet controller updates the AvailableReplicas field in the StatefulSet's status.
maxSurge for DaemonSets
The DaemonSet controller creates the additional pods (above the desired number resulting from DaemonSet spec) based on the value given in .spec.strategy.rollingUpdate.maxSurge. The additional pods would run on the same node where the old DaemonSet pod is running till the old pod gets killed.
The default value is 0.
The value cannot be 0 when MaxUnavailable is 0.
The value can be specified either as an absolute number of pods, or a percentage (rounded up) of desired pods.
How do I use it?
minReadySeconds for StatefulSets
Specify a value for minReadySeconds for any StatefulSet and check if pods are available or not by inspecting
AvailableReplicas field using:
kubectl get statefulset/<name_of_the_statefulset> -o yaml
Please note that the default value of minReadySeconds is 0.
maxSurge for DaemonSets
Specify a value for .spec.updateStrategy.rollingUpdate.maxSurge and set .spec.updateStrategy.rollingUpdate.maxUnavailable to 0.
Then observe a faster rollout and higher number of pods running at the same time in the next rollout.
kubectl rollout restart daemonset <name_of_the_daemonset>
kubectl get pods -w
Kubernetes 1.25 introduces Alpha support for a new kubelet-managed pod condition
in the status field of a pod: PodHasNetwork. The kubelet, for a worker node,
will use the PodHasNetwork condition to accurately surface the initialization
state of a pod from the perspective of pod sandbox creation and network
configuration by a container runtime (typically in coordination with CNI
plugins). The kubelet starts to pull container images and start individual
containers (including init containers) after the status of the PodHasNetwork
condition is set to True. Metrics collection services that report latency of
pod initialization from a cluster infrastructural perspective (i.e. agnostic of
per container characteristics like image size or payload) can utilize the
PodHasNetwork condition to accurately generate Service Level Indicators
(SLIs). Certain operators or controllers that manage underlying pods may utilize
the PodHasNetwork condition to optimize the set of actions performed when pods
repeatedly fail to come up.
How is this different from the existing Initialized condition reported for pods?
The kubelet sets the status of the existing Initialized condition reported in
the status field of a pod depending on the presence of init containers in a pod.
If a pod specifies init containers, the status of the Initialized condition in
the pod status will not be set to True until all init containers for the pod
have succeeded. However, init containers, configured by users, may have errors
(payload crashing, invalid image, etc) and the number of init containers
configured in a pod may vary across different workloads. Therefore,
cluster-wide, infrastructural SLIs around pod initialization cannot depend on
the Initialized condition of pods.
If a pod does not specify init containers, the status of the Initialized
condition in the pod status is set to True very early in the lifecycle of the
pod. This occurs before the kubelet initiates any pod runtime sandbox creation
and network configuration steps. As a result, a pod without init containers will
report the status of the Initialized condition as True even if the container
runtime is not able to successfully initialize the pod sandbox environment.
Relative to either situation above, the PodHasNetwork condition surfaces more
accurate data around when the pod runtime sandbox was initialized with
networking configured so that the kubelet can proceed to launch user-configured
containers (including init containers) in the pod.
Note that a node agent may dynamically re-configure network interface(s) for a
pod by watching changes in pod annotations that specify additional networking
configuration (e.g. k8s.v1.cni.cncf.io/networks). Dynamic updates of pod
networking configuration after the pod sandbox is initialized by Kubelet (in
coordination with a container runtime) are not reflected by the PodHasNetwork
condition.
Try out the PodHasNetwork condition for pods
In order to have the kubelet report the PodHasNetwork condition in the status
field of a pod, please enable the PodHasNetworkCondition feature gate on the
kubelet.
For a pod whose runtime sandbox has been successfully created and has networking
configured, the kubelet will report the PodHasNetwork condition with status set to True:
$ kubectl describe pod nginx1
Name: nginx1
Namespace: default
...
Conditions:
Type Status
PodHasNetwork True
Initialized True
Ready True
ContainersReady True
PodScheduled True
For a pod whose runtime sandbox has not been created yet (and networking not
configured either), the kubelet will report the PodHasNetwork condition with
status set to False:
$ kubectl describe pod nginx2
Name: nginx2
Namespace: default
...
Conditions:
Type Status
PodHasNetwork False
Initialized True
Ready False
ContainersReady False
PodScheduled True
What’s next?
Depending on feedback and adoption, the Kubernetes team plans to push the
reporting of the PodHasNetwork condition to Beta in 1.26 or 1.27.
How can I learn more?
Please check out the
documentation for the
PodHasNetwork condition to learn more about it and how it fits in relation to
other pod conditions.
How to get involved?
This feature is driven by the SIG Node community. Please join us to connect with
the community and share your ideas and feedback around the above feature and
beyond. We look forward to hearing from you!
Acknowledgements
We want to thank the following people for their insightful and helpful reviews
of the KEP and PRs around this feature: Derek Carr (@derekwaynecarr), Mrunal
Patel (@mrunalp), Dawn Chen (@dchen1107), Qiutong Song (@qiutongs), Ruiwen Zhao
(@ruiwen-zhao), Tim Bannister (@sftim), Danielle Lancashire (@endocrimes) and
Agam Dua (@agamdua).
A long-standing request from the Kubernetes community has been to have a
programmatic way for end users to keep track of Kubernetes security issues
(also called "CVEs", after the database that tracks public security issues across
different products and vendors). Accompanying the release of Kubernetes v1.25,
we are excited to announce availability of such
a feed as an alpha
feature. This blog will cover the background and scope of this new service.
Motivation
With the growing number of eyes on Kubernetes, the number of CVEs related to
Kubernetes have increased. Although most CVEs that directly, indirectly, or
transitively impact Kubernetes are regularly fixed, there is no single place for
the end users of Kubernetes to programmatically subscribe or pull the data of
fixed CVEs. Current options are either broken or incomplete.
Scope
What This Does
Create a periodically auto-refreshing, human and machine-readable list of
official Kubernetes CVEs
What This Doesn't Do
Triage and vulnerability disclosure will continue to be done by SRC (Security
Response Committee).
Listing CVEs that are identified in build time dependencies and container
images are out of scope.
Only official CVEs announced by the Kubernetes SRC will be published in the
feed.
Who It's For
End Users: Persons or teams who use Kubernetes to deploy applications
they own
Platform Providers: Persons or teams who manage Kubernetes clusters
Maintainers: Persons or teams who create and support Kubernetes
releases through their work in Kubernetes Community - via various Special
Interest Groups and Committees.
What's Next?
In order to graduate this feature, SIG Security
is gathering feedback from end users who are using this alpha feed.
So in order to improve the feed in future Kubernetes Releases, if you have any
feedback, please let us know by adding a comment to
this tracking issue or
let us know on
#sig-security-tooling
Kubernetes Slack channel.
(Join Kubernetes Slack here)
A special shout out and massive thanks to Neha Lohia
(@nehalohia27) and Tim
Bannister (@sftim) for their stellar collaboration
for many months from "ideation to implementation" of this feature.
Authors: Anish Ramasekar, Rita Zhang, Mo Khan, and Xander Grzywinski (Microsoft)
With Kubernetes v1.25, SIG Auth is introducing a new v2alpha1 version of the Key Management Service (KMS) API. There are a lot of improvements in the works, and we're excited to be able to start down the path of a new and improved KMS!
What is KMS?
One of the first things to consider when securing a Kubernetes cluster is encrypting persisted API data at rest. KMS provides an interface for a provider to utilize a key stored in an external key service to perform this encryption.
Encryption at rest using KMS v1 has been a feature of Kubernetes since version v1.10, and is currently in beta as of version v1.12.
What’s new in v2alpha1?
While the original v1 implementation has been successful in helping Kubernetes users encrypt etcd data, it did fall short in a few key ways:
Performance: When starting a cluster, all resources are serially fetched and decrypted to fill the kube-apiserver cache. When using a KMS plugin, this can cause slow startup times due to the large number of requests made to the remote vault. In addition, there is the potential to hit API rate limits on external key services depending on how many encrypted resources exist in the cluster.
Key Rotation: With KMS v1, rotation of a key-encrypting key is a manual and error-prone process. It can be difficult to determine what encryption keys are in-use on a cluster.
Health Check & Status: Before the KMS v2 API, the kube-apiserver was forced to make encrypt and decrypt calls as a proxy to determine if the KMS plugin is healthy. With cloud services these operations usually cost actual money with cloud service. Whatever the cost, those operations on their own do not provide a holistic view of the service's health.
Observability: Without some kind of trace ID, it's has been difficult to correlate events found in the various logs across kube-apiserver, KMS, and KMS plugins.
The KMS v2 enhancement attempts to address all of these shortcomings, though not all planned features are implemented in the initial alpha release. Here are the improvements that arrived in Kubernetes v1.25:
Extra metadata is now tracked to allow a KMS plugin to communicate what key it is currently using with the kube-apiserver, allowing for rotation without API server restart. Data stored in etcd follows a more standard proto format to allow external tools to observe its state. To learn more, check out the details for metadata.
A dedicated status API is used to communicate the health of the KMS plugin with the API server. To learn more, check out the details for status API.
To improve observability, a new UID field is included in EncryptRequest and DecryptRequest of the v2 API. The UID is generated for each envelope operation. To learn more, check out the details for observability.
Sequence Diagram
Encrypt Request
Decrypt Request
What’s next?
For Kubernetes v1.26, we expect to ship another alpha version. As of right now, the alpha API will be ready to be used by KMS plugin authors. We hope to include a reference plugin implementation with the next release, and you'll be able to try out the feature at that time.
You can learn more about KMS v2 by reading Using a KMS provider for data encryption. You can also follow along on the KEP to track progress across the coming Kubernetes releases.
How to get involved
If you are interested in getting involved in the development of this feature or would like to share feedback, please reach out on the #sig-auth-kms-dev channel on Kubernetes Slack.
You are also welcome to join the bi-weekly SIG Auth meetings, held every-other Wednesday.
Acknowledgements
This feature has been an effort driven by contributors from several different companies. We would like to extend a huge thank you to everyone that contributed their time and effort to help make this possible.
Some Kubernetes components (such as kubelet and kube-proxy) create
iptables chains and rules as part of their operation. These chains
were never intended to be part of any Kubernetes API/ABI guarantees,
but some external components nonetheless make use of some of them (in
particular, using KUBE-MARK-MASQ to mark packets as needing to be
masqueraded).
As a part of the v1.25 release, SIG Network made this declaration
explicit: that (with one exception), the iptables chains that
Kubernetes creates are intended only for Kubernetes’s own internal
use, and third-party components should not assume that Kubernetes will
create any specific iptables chains, or that those chains will contain
any specific rules if they do exist.
Then, in future releases, as part of KEP-3178, we will begin phasing
out certain chains that Kubernetes itself no longer needs. Components
outside of Kubernetes itself that make use of KUBE-MARK-MASQ,
KUBE-MARK-DROP, or other Kubernetes-generated iptables chains should
start migrating away from them now.
Background
In addition to various service-specific iptables chains, kube-proxy
creates certain general-purpose iptables chains that it uses as part
of service proxying. In the past, kubelet also used iptables for a few
features (such as setting up hostPort mapping for pods) and so it
also redundantly created some of the same chains.
However, with the removal of dockershim in Kubernetes in 1.24,
kubelet now no longer ever uses any iptables rules for its own
purposes; the things that it used to use iptables for are now always
the responsibility of the container runtime or the network plugin, and
there is no reason for kubelet to be creating any iptables rules.
Meanwhile, although iptables is still the default kube-proxy backend
on Linux, it is unlikely to remain the default forever, since the
associated command-line tools and kernel APIs are essentially
deprecated, and no longer receiving improvements. (RHEL 9
logs a warning if you use the iptables API, even via
iptables-nft.)
Although as of Kubernetes 1.25 iptables kube-proxy remains popular,
and kubelet continues to create the iptables rules that it
historically created (despite no longer using them), third party
software cannot assume that core Kubernetes components will keep
creating these rules in the future.
Upcoming changes
Starting a few releases from now, kubelet will no longer create the
following iptables chains in the nat table:
KUBE-MARK-DROP
KUBE-MARK-MASQ
KUBE-POSTROUTING
Additionally, the KUBE-FIREWALL chain in the filter table will no
longer have the functionality currently associated with
KUBE-MARK-DROP (and it may eventually go away entirely).
This change will be phased in via the IPTablesOwnershipCleanup
feature gate. That feature gate is available and can be manually
enabled for testing in Kubernetes 1.25. The current plan is that it
will become enabled-by-default in Kubernetes 1.27, though this may be
delayed to a later release. (It will not happen sooner than Kubernetes
1.27.)
What to do if you use Kubernetes’s iptables chains
(Although the discussion below focuses on short-term fixes that are
still based on iptables, you should probably also start thinking about
eventually migrating to nftables or another API).
If you use KUBE-MARK-MASQ...
If you are making use of the KUBE-MARK-MASQ chain to cause packets
to be masqueraded, you have two options: (1) rewrite your rules to use
-j MASQUERADE directly, (2) create your own alternative “mark for
masquerade” chain.
The reason kube-proxy uses KUBE-MARK-MASQ is because there are lots
of cases where it needs to call both -j DNAT and -j MASQUERADE on
a packet, but it’s not possible to do both of those at the same time
in iptables; DNAT must be called from the PREROUTING (or OUTPUT)
chain (because it potentially changes where the packet will be routed
to) while MASQUERADE must be called from POSTROUTING (because the
masqueraded source IP that it picks depends on what the final routing
decision was).
In theory, kube-proxy could have one set of rules to match packets in
PREROUTING/OUTPUT and call -j DNAT, and then have a second set
of rules to match the same packets in POSTROUTING and call -j MASQUERADE. But instead, for efficiency, it only matches them once,
during PREROUTING/OUTPUT, at which point it calls -j DNAT and
then calls -j KUBE-MARK-MASQ to set a bit on the kernel packet mark
as a reminder to itself. Then later, during POSTROUTING, it has a
single rule that matches all previously-marked packets, and calls -j MASQUERADE on them.
If you have a lot of rules where you need to apply both DNAT and
masquerading to the same packets like kube-proxy does, then you may
want a similar arrangement. But in many cases, components that use
KUBE-MARK-MASQ are only doing it because they copied kube-proxy’s
behavior without understanding why kube-proxy was doing it that way.
Many of these components could easily be rewritten to just use
separate DNAT and masquerade rules. (In cases where no DNAT is
occurring then there is even less point to using KUBE-MARK-MASQ;
just move your rules from PREROUTING to POSTROUTING and call -j MASQUERADE directly.)
If you use KUBE-MARK-DROP...
The rationale for KUBE-MARK-DROP is similar to the rationale for
KUBE-MARK-MASQ: kube-proxy wanted to make packet-dropping decisions
alongside other decisions in the natKUBE-SERVICES chain, but you
can only call -j DROP from the filter table. So instead, it uses
KUBE-MARK-DROP to mark packets to be dropped later on.
In general, the approach for removing a dependency on KUBE-MARK-DROP
is the same as for removing a dependency on KUBE-MARK-MASQ. In
kube-proxy’s case, it is actually quite easy to replace the usage of
KUBE-MARK-DROP in the nat table with direct calls to DROP in the
filter table, because there are no complicated interactions between
DNAT rules and drop rules, and so the drop rules can simply be moved
from nat to filter.
In more complicated cases, it might be necessary to “re-match” the
same packets in both nat and filter.
If you use Kubelet’s iptables rules to figure out iptables-legacy vs iptables-nft...
Components that manipulate host-network-namespace iptables rules from
inside a container need some way to figure out whether the host is
using the old iptables-legacy binaries or the newer iptables-nft
binaries (which talk to a different kernel API underneath).
The iptables-wrappers module provides a way for such components to
autodetect the system iptables mode, but in the past it did this by
assuming that Kubelet will have created “a bunch” of iptables rules
before any containers start, and so it can guess which mode the
iptables binaries in the host filesystem are using by seeing which
mode has more rules defined.
In future releases, Kubelet will no longer create many iptables rules,
so heuristics based on counting the number of rules present may fail.
However, as of 1.24, Kubelet always creates a chain named
KUBE-IPTABLES-HINT in the mangle table of whichever iptables
subsystem it is using. Components can now look for this specific chain
to know which iptables subsystem Kubelet (and thus, presumably, the
rest of the system) is using.
(Additionally, since Kubernetes 1.17, kubelet has created a chain
called KUBE-KUBELET-CANARY in the mangle table. While this chain
may go away in the future, it will of course still be there in older
releases, so in any recent version of Kubernetes, at least one of
KUBE-IPTABLES-HINT or KUBE-KUBELET-CANARY will be present.)
The iptables-wrappers package has already been updated with this new
heuristic, so if you were previously using that, you can rebuild your
container images with an updated version of that.
Further reading
The project to clean up iptables chain ownership and deprecate the old
chains is tracked by KEP-3178.
This article introduces the Container Object Storage Interface (COSI), a standard for provisioning and consuming object storage in Kubernetes. It is an alpha feature in Kubernetes v1.25.
File and block storage are treated as first class citizens in the Kubernetes ecosystem via Container Storage Interface (CSI). Workloads using CSI volumes enjoy the benefits of portability across vendors and across Kubernetes clusters without the need to change application manifests. An equivalent standard does not exist for Object storage.
Object storage has been rising in popularity in recent years as an alternative form of storage to filesystems and block devices. Object storage paradigm promotes disaggregation of compute and storage. This is done by making data available over the network, rather than locally. Disaggregated architectures allow compute workloads to be stateless, which consequently makes them easier to manage, scale and automate.
COSI
COSI aims to standardize consumption of object storage to provide the following benefits:
Kubernetes Native - Use the Kubernetes API to provision, configure and manage buckets
Self Service - A clear delineation between administration and operations (DevOps) to enable self-service capability for DevOps personnel
Portability - Vendor neutrality enabled through portability across Kubernetes Clusters and across Object Storage vendors
Portability across vendors is only possible when both vendors support a common datapath-API. Eg. it is possible to port from AWS S3 to Ceph, or AWS S3 to MinIO and back as they all use S3 API. In contrast, it is not possible to port from AWS S3 and Google Cloud’s GCS or vice versa.
Architecture
COSI is made up of three components:
COSI Controller Manager
COSI Sidecar
COSI Driver
The COSI Controller Manager acts as the main controller that processes changes to COSI API objects. It is responsible for fielding requests for bucket creation, updates, deletion and access management. One instance of the controller manager is required per kubernetes cluster. Only one is needed even if multiple object storage providers are used in the cluster.
The COSI Sidecar acts as a translator between COSI API requests and vendor-specific COSI Drivers. This component uses a standardized gRPC protocol that vendor drivers are expected to satisfy.
The COSI Driver is the vendor specific component that receives requests from the sidecar and calls the appropriate vendor APIs to create buckets, manage their lifecycle and manage access to them.
API
The COSI API is centered around buckets, since bucket is the unit abstraction for object storage. COSI defines three Kubernetes APIs aimed at managing them
Bucket
BucketClass
BucketClaim
In addition, two more APIs for managing access to buckets are also defined:
BucketAccess
BucketAccessClass
In a nutshell, Bucket and BucketClaim can be considered to be similar to PersistentVolume and PersistentVolumeClaim respectively. The BucketClass’ counterpart in the file/block device world is StorageClass.
Since Object Storage is always authenticated, and over the network, access credentials are required to access buckets. The two APIs, namely, BucketAccess and BucketAccessClass are used to denote access credentials and policies for authentication. More info about these APIs can be found in the official COSI proposal - https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1979-object-storage-support
Self-Service
Other than providing kubernetes-API driven bucket management, COSI also aims to empower DevOps personnel to provision and manage buckets on their own, without admin intervention. This, further enabling dev teams to realize faster turn-around times and faster time-to-market.
COSI achieves this by dividing bucket provisioning steps among two different stakeholders, namely the administrator (admin), and the cluster operator. The administrator will be responsible for setting broad policies and limits on how buckets are provisioned, and how access is obtained for them. The cluster operator will be free to create and utilize buckets within the limits set by the admin.
For example, a cluster operator could use an admin policy could be used to restrict maximum provisioned capacity to 100GB, and developers would be allowed to create buckets and store data upto that limit. Similarly for access credentials, admins would be able to restrict who can access which buckets, and developers would be able to access all the buckets available to them.
Portability
The third goal of COSI is to achieve vendor neutrality for bucket management. COSI enables two kinds of portability:
Cross Cluster
Cross Provider
Cross Cluster portability is allowing buckets provisioned in one cluster to be available in another cluster. This is only valid when the object storage backend itself is accessible from both clusters.
Cross-provider portability is about allowing organizations or teams to move from one object storage provider to another seamlessly, and without requiring changes to application definitions (PodTemplates, StatefulSets, Deployment and so on). This is only possible if the source and destination providers use the same data.
COSI does not handle data migration as it is outside of its scope. In case porting between providers requires data to be migrated as well, then other measures need to be taken to ensure data availability.
What’s next
The amazing sig-storage-cosi community has worked hard to bring the COSI standard to alpha status. We are looking forward to onboarding a lot of vendors to write COSI drivers and become COSI compatible!
We want to add more authentication mechanisms for COSI buckets, we are designing advanced bucket sharing primitives, multi-cluster bucket management and much more. Lots of great ideas and opportunities ahead!
Stay tuned for what comes next, and if you have any questions, comments or suggestions
Authors:: David Porter (Google), Mrunal Patel (Red Hat)
Kubernetes 1.25 brings cgroup v2 to GA (general availability), letting the
kubelet use the latest container resource
management capabilities.
What are cgroups?
Effective resource management is a
critical aspect of Kubernetes. This involves managing the finite resources in
your nodes, such as CPU, memory, and storage.
cgroups are a Linux kernel capability that establish resource management
functionality like limiting CPU usage or setting memory limits for running
processes.
When you use the resource management capabilities in Kubernetes, such as configuring
requests and limits for Pods and containers,
Kubernetes uses cgroups to enforce your resource requests and limits.
The Linux kernel offers two versions of cgroups: cgroup v1 and cgroup v2.
What is cgroup v2?
cgroup v2 is the latest version of the Linux cgroup API. cgroup v2 provides a
unified control system with enhanced resource management capabilities.
cgroup v2 has been development in the Linux Kernel since 2016 and in recent
years has matured across the container ecosystem. With Kubernetes 1.25, cgroup
v2 support has graduated to general availability.
Many recent releases of Linux distributions have switched over to cgroup v2 by
default so it's important that Kubernetes continues to work well on these new
updated distros.
cgroup v2 offers several improvements over cgroup v1, such as the following:
Enhanced resource allocation management and isolation across multiple resources
Unified accounting for different types of memory allocations (network and kernel memory, etc)
Accounting for non-immediate resource changes such as page cache write backs
Some Kubernetes features exclusively use cgroup v2 for enhanced resource
management and isolation. For example,
the MemoryQoS feature improves
memory utilization and relies on cgroup v2 functionality to enable it. New
resource management features in the kubelet will also take advantage of the new
cgroup v2 features moving forward.
How do you use cgroup v2?
Many Linux distributions are switching to cgroup v2 by default; you might start
using it the next time you update the Linux version of your control plane and
nodes!
Using a Linux distribution that uses cgroup v2 by default is the recommended
method. Some of the popular Linux distributions that use cgroup v2 include the
following:
Container Optimized OS (since M97)
Ubuntu (since 21.10)
Debian GNU/Linux (since Debian 11 Bullseye)
Fedora (since 31)
Arch Linux (since April 2021)
RHEL and RHEL-like distributions (since 9)
To check if your distribution uses cgroup v2 by default,
refer to Check your cgroup version or
consult your distribution's documentation.
If you're using a managed Kubernetes offering, consult your provider to
determine how they're adopting cgroup v2, and whether you need to take action.
To use cgroup v2 with Kubernetes, you must meet the following requirements:
Your Linux distribution enables cgroup v2 on kernel version 5.8 or later
Your container runtime supports cgroup v2. For example:
The kubelet and the container runtime are configured to use the systemd cgroup driver
The kubelet and container runtime use a cgroup driver
to set cgroup paramaters. When using cgroup v2, it's strongly recommended that both
the kubelet and your container runtime use the
systemd cgroup driver,
so that there's a single cgroup manager on the system. To configure the kubelet
and the container runtime to use the driver, refer to the
systemd cgroup driver documentation.
Migrate to cgroup v2
When you run Kubernetes with a Linux distribution that enables cgroup v2, the
kubelet should automatically adapt without any additional configuration
required, as long as you meet the requirements.
In most cases, you won't see a difference in the user experience when you
switch to using cgroup v2 unless your users access the cgroup file system
directly.
If you have applications that access the cgroup file system directly, either on
the node or from inside a container, you must update the applications to use
the cgroup v2 API instead of the cgroup v1 API.
Scenarios in which you might need to update to cgroup v2 include the following:
If you run third-party monitoring and security agents that depend on the cgroup file system, update the
agents to versions that support cgroup v2.
If you run cAdvisor as a stand-alone
DaemonSet for monitoring pods and containers, update it to v0.43.0 or later.
If you deploy Java applications with the JDK, prefer to use JDK 11.0.16 and
later or JDK 15 and later, which fully support cgroup v2.
Learn more about
cgroups on Linux Manual Pages
and cgroup v2 on the Linux Kernel documentation
Get involved
Your feedback is always welcome! SIG Node meets regularly and are available in
the #sig-node channel in the Kubernetes Slack, or
using the SIG mailing list.
cgroup v2 has had a long journey and is a great example of open source
community collaboration across the industry because it required work across the
stack, from the Linux Kernel to systemd to various container runtimes, and (of
course) Kubernetes.
Acknowledgments
We would like to thank Giuseppe Scrivano who
initiated cgroup v2 support in Kubernetes, and reviews and leadership from the
SIG Node community including chairs Dawn Chen
and Derek Carr.
We'd also like to thank the maintainers of container runtimes like Docker,
containerd and CRI-O, and the maintainers of components like
cAdvisor
and runc, libcontainer,
which underpin many container runtimes. Finally, this wouldn't have been
possible without support from systemd and upstream Linux Kernel maintainers.
CSI Inline Volumes were introduced as an alpha feature in Kubernetes 1.15 and have been beta since 1.16. We are happy to announce that this feature has graduated to General Availability (GA) status in Kubernetes 1.25.
CSI Inline Volumes are similar to other ephemeral volume types, such as configMap, downwardAPI and secret. The important difference is that the storage is provided by a CSI driver, which allows the use of ephemeral storage provided by third-party vendors. The volume is defined as part of the pod spec and follows the lifecycle of the pod, meaning the volume is created once the pod is scheduled and destroyed when the pod is destroyed.
What's new in 1.25?
There are a couple of new bug fixes related to this feature in 1.25, and the CSIInlineVolume feature gate has been locked to True with the graduation to GA. There are no new API changes, so users of this feature during beta should not notice any significant changes aside from these bug fixes.
CSI inline volumes are meant for simple local volumes that should follow the lifecycle of the pod. They may be useful for providing secrets, configuration data, or other special-purpose storage to the pod from a CSI driver.
A CSI driver is not suitable for inline use when:
The volume needs to persist longer than the lifecycle of a pod
Volume snapshots, cloning, or volume expansion are required
The CSI driver requires volumeAttributes that should be restricted to an administrator
How to use this feature
In order to use this feature, the CSIDriver spec must explicitly list Ephemeral as one of the supported volumeLifecycleModes. Here is a simple example from the Secrets Store CSI Driver.
If the driver supports any volume attributes, you can provide these as part of the spec for the Pod as well:
csi:
driver: block.csi.vendor.example
volumeAttributes:
foo: bar
Example Use Cases
Two existing CSI drivers that support the Ephemeral volume lifecycle mode are the Secrets Store CSI Driver and the Cert-Manager CSI Driver.
The Secrets Store CSI Driver allows users to mount secrets from external secret stores into a pod as an inline volume. This can be useful when the secrets are stored in an external managed service or Vault instance.
The Cert-Manager CSI Driver works along with cert-manager to seamlessly request and mount certificate key pairs into a pod. This allows the certificates to be renewed and updated in the application pod automatically.
Security Considerations
Special consideration should be given to which CSI drivers may be used as inline volumes. volumeAttributes are typically controlled through the StorageClass, and may contain attributes that should remain restricted to the cluster administrator. Allowing a CSI driver to be used for inline ephmeral volumes means that any user with permission to create pods may also provide volumeAttributes to the driver through a pod spec.
Cluster administrators may choose to omit (or remove) Ephemeral from volumeLifecycleModes in the CSIDriver spec to prevent the driver from being used as an inline ephemeral volume, or use an admission webhook to restrict how the driver is used.
Authors: Tim Allclair (Google), Sam Stoelinga (Google)
The release of Kubernetes v1.25 marks a major milestone for Kubernetes out-of-the-box pod security
controls: Pod Security admission (PSA) graduated to stable, and Pod Security Policy (PSP) has been
removed.
PSP was deprecated in Kubernetes v1.21,
and no longer functions in Kubernetes v1.25 and later.
The Pod Security admission controller replaces PodSecurityPolicy, making it easier to enforce predefined
Pod Security Standards by
simply adding a label to a namespace. The Pod Security Standards are maintained by the K8s
community, which means you automatically get updated security policies whenever new
security-impacting Kubernetes features are introduced.
What’s new since Beta?
Pod Security Admission hasn’t changed much since the Beta in Kubernetes v1.23. The focus has been on
improving the user experience, while continuing to maintain a high quality bar.
Improved violation messages
We improved violation messages so that you get
fewer duplicate messages. For example,
instead of the following message when the Baseline and Restricted policies check the same
capability:
pods "admin-pod" is forbidden: violates PodSecurity "restricted:latest": non-default capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add), unrestricted capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add)
You get this message:
pods "admin-pod" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add)
Improved namespace warnings
When you modify the enforce Pod Security labels on a namespace, the Pod Security
admission controller checks all existing pods for
violations and surfaces a warning if any are out of compliance. These
warnings are now aggregated for pods with
identical violations, making large namespaces with many replicas much more manageable. For example:
Additionally, when you apply a non-privileged label to a namespace that has been
configured to be exempt,
you will now get a warning alerting you to this fact:
Warning: namespace 'kube-system' is exempt from Pod Security, and the policy (enforce=baseline:latest) will be ignored
Changes to the Pod Security Standards
The Pod Security Standards,
which Pod Security admission enforces, have been updated with support for the new Pod OS
field. In v1.25 and later, if you use the Restricted policy, the following Linux-specific restrictions will no
longer be required if you explicitly set the pod's .spec.os.name field to windows:
Seccomp - The seccompProfile.type field for Pod and container security contexts
Privilege escalation - The allowPrivilegeEscalation field on container security contexts
Capabilities - The requirement to drop ALL capabilities in the capabilities field on containers
In Kubernetes v1.23 and earlier, the kubelet didn't enforce the Pod OS field.
If your cluster includes nodes running a v1.23 or older kubelet, you should explicitly
pin Restricted policies
to a version prior to v1.25.
Migrating from PodSecurityPolicy to the Pod Security admission controller
For instructions to migrate from PodSecurityPolicy to the Pod Security admission controller, and
for help choosing a migration strategy, refer to the
migration guide.
We're also developing a tool called
pspmigrator to automate parts
of the migration process.
The PodSecurityPolicy (PSP) admission controller has been removed, as of
Kubernetes v1.25. Its deprecation was announced and detailed in the blog post
PodSecurityPolicy Deprecation: Past, Present, and Future,
published for the Kubernetes v1.21 release.
This article aims to provide historical context on the birth and evolution of
PSP, explain why the feature never made it to stable, and show why it was
removed and replaced by Pod Security admission control.
PodSecurityPolicy, like other specialized admission control plugins, provided
fine-grained permissions on specific fields concerning the pod security settings
as a built-in policy API. It acknowledged that cluster administrators and
cluster users are usually not the same people, and that creating workloads in
the form of a Pod or any resource that will create a Pod should not equal being
"root on the cluster". It could also encourage best practices by configuring
more secure defaults through mutation and decoupling low-level Linux security
decisions from the deployment process.
The birth of PodSecurityPolicy
PodSecurityPolicy originated from OpenShift's SecurityContextConstraints
(SCC) that were in the very first release of the Red Hat OpenShift Container Platform,
even before Kubernetes 1.0. PSP was a stripped-down version of the SCC.
The origin of the creation of PodSecurityPolicy is difficult to track, notably
because it was mainly added before Kubernetes Enhancements Proposal (KEP)
process, when design proposals were still a thing. Indeed, the archive of the final
design proposal
is still available. Nevertheless, a KEP issue number five
was created after the first pull requests were merged.
Before adding the first piece of code that created PSP, two main pull
requests were merged into Kubernetes, a SecurityContext subresource
that defined new fields on pods' containers, and the first iteration of the ServiceAccount
API.
Kubernetes 1.0 was released on 10 July 2015 without any mechanism to restrict the
security context and sensitive options of workloads, other than an alpha-quality
SecurityContextDeny admission plugin (then known as scdeny).
The SecurityContextDeny plugin
is still in Kubernetes today (as an alpha feature) and creates an admission controller that
prevents the usage of some fields in the security context.
The roots of the PodSecurityPolicy were added with
the very first pull request on security policy,
which added the design proposal with the new PSP object, based on the SCC (Security Context Constraints). It
was a long discussion of nine months, with back and forth from OpenShift's SCC,
many rebases, and the rename to PodSecurityPolicy that finally made it to
upstream Kubernetes in February 2016. Now that the PSP object
had been created, the next step was to add an admission controller that could enforce
these policies. The first step was to add the admission
without taking into account the users or groups.
A specific issue to bring PodSecurityPolicy to a usable state
was added to keep track of the progress and a first version of the admission
controller was merged in pull request named PSP admission
in May 2016. Then around two months later, Kubernetes 1.3 was released.
Here is a timeline that recaps the main pull requests of the birth of the
PodSecurityPolicy and its admission controller with 1.0 and 1.3 releases as
reference points.
After that, the PSP admission controller was enhanced by adding what was initially
left aside. The authorization mechanism,
merged in early November 2016 allowed administrators to use multiple policies
in a cluster to grant different levels of access for different types of users.
Later, a pull request
merged in October 2017 fixed a design issue
on ordering PodSecurityPolicies between mutating and alphabetical order, and continued to
build the PSP admission as we know it. After that, many improvements and fixes
followed to build the PodSecurityPolicy feature of recent Kubernetes releases.
The rise of Pod Security Admission
Despite the crucial issue it was trying to solve, PodSecurityPolicy presented
some major flaws:
Flawed authorization model - users can create a pod if they have the
use verb on the PSP that allows that pod or the pod's service account has
the use permission on the allowing PSP.
Difficult to roll out - PSP fail-closed. That is, in the absence of a policy,
all pods are denied. It mostly means that it cannot be enabled by default and
that users have to add PSPs for all workloads before enabling the feature,
thus providing no audit mode to discover which pods would not be allowed by
the new policy. The opt-in model also leads to insufficient test coverage and
frequent breakage due to cross-feature incompatibility. And unlike RBAC,
there was no strong culture of shipping PSP manifests with projects.
Inconsistent unbounded API - the API has grown with lots of
inconsistencies notably because of many requests for niche use cases: e.g.
labels, scheduling, fine-grained volume controls, etc. It has poor
composability with a weak prioritization model, leading to unexpected
mutation priority. It made it really difficult to combine PSP with other
third-party admission controllers.
Require security knowledge - effective usage still requires an
understanding of Linux security primitives. e.g. MustRunAsNonRoot +
AllowPrivilegeEscalation.
The experience with PodSecurityPolicy concluded that most users care for two or three
policies, which led to the creation of the Pod Security Standards,
that define three policies:
Privileged - unrestricted policy.
Baseline - minimally restrictive policy, allowing the default pod
configuration.
Restricted - security best practice policy.
The replacement for PSP, the new Pod Security Admission
is an in-tree, stable for Kubernetes v1.25, admission plugin to enforce these
standards at the namespace level. It makes it easier to enforce basic pod
security without deep security knowledge. For more sophisticated use cases, you
might need a third-party solution that can be easily combined with Pod Security
Admission.
This release includes a total of 40 enhancements. Fifteen of those enhancements are entering Alpha, ten are graduating to Beta, and thirteen are graduating to Stable. We also have two features being deprecated or removed.
Release theme and logo
Kubernetes 1.25: Combiner
The theme for Kubernetes v1.25 is Combiner.
The Kubernetes project itself is made up of many, many individual components that, when combined, take the form of the project you see today. It is also built and maintained by many individuals, all of them with different skills, experiences, histories, and interests, who join forces not just as the release team but as the many SIGs that support the project and the community year-round.
With this release we wish to honor the collaborative, open spirit that takes us from isolated developers, writers, and users spread around the globe to a combined force capable of changing the world. Kubernetes v1.25 includes a staggering 40 enhancements, none of which would exist without the incredible power we have when we work together.
Inspired by our release lead's son, Albert Song, Kubernetes v1.25 is named for each and every one of you, no matter how you choose to contribute your unique power to the combined force that becomes Kubernetes.
What's New (Major Themes)
PodSecurityPolicy is removed; Pod Security Admission graduates to Stable
PodSecurityPolicy was initially deprecated in v1.21, and with the release of v1.25, it has been removed. The updates required to improve its usability would have introduced breaking changes, so it became necessary to remove it in favor of a more friendly replacement. That replacement is Pod Security Admission, which graduates to Stable with this release. If you are currently relying on PodSecurityPolicy, please follow the instructions for migration to Pod Security Admission.
Ephemeral Containers Graduate to Stable
Ephemeral Containers are containers that exist for only a limited time within an existing pod. This is particularly useful for troubleshooting when you need to examine another container but cannot use kubectl exec because that container has crashed or its image lacks debugging utilities. Ephemeral containers graduated to Beta in Kubernetes v1.23, and with this release, the feature graduates to Stable.
Support for cgroups v2 Graduates to Stable
It has been more than two years since the Linux kernel cgroups v2 API was declared stable. With some distributions now defaulting to this API, Kubernetes must support it to continue operating on those distributions. cgroups v2 offers several improvements over cgroups v1, for more information see the cgroups v2 documentation. While cgroups v1 will continue to be supported, this enhancement puts us in a position to be ready for its eventual deprecation and replacement.
Promoted endPort in Network Policy to GA. Network Policy providers that support endPort field now can use it to specify a range of ports to apply a Network Policy. Previously, each Network Policy could only target a single port.
Please be aware that endPort field must be supported by the Network Policy provider. If your provider does not support endPort, and this field is specified in a Network Policy, the Network Policy will be created covering only the port field (single port).
Promoted Local Ephemeral Storage Capacity Isolation to Stable
The Local Ephemeral Storage Capacity Isolation feature moved to GA. This was introduced as alpha in 1.8, moved to beta in 1.10, and it is now a stable feature. It provides support for capacity isolation of local ephemeral storage between pods, such as EmptyDir, so that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of local ephemeral storage exceeds that limit.
Promoted core CSI Migration to Stable
CSI Migration is an ongoing effort that SIG Storage has been working on for a few releases. The goal is to move in-tree volume plugins to out-of-tree CSI drivers and eventually remove the in-tree volume plugins. The core CSI Migration feature moved to GA. CSI Migration for GCE PD and AWS EBS also moved to GA. CSI Migration for vSphere remains in beta (but is on by default). CSI Migration for Portworx moved to Beta (but is off-by-default).
Promoted CSI Ephemeral Volume to Stable
The CSI Ephemeral Volume feature allows CSI volumes to be specified directly in the pod specification for ephemeral use cases. They can be used to inject arbitrary states, such as configuration, secrets, identity, variables or similar information, directly inside pods using a mounted volume. This was initially introduced in 1.15 as an alpha feature, and it moved to GA. This feature is used by some CSI drivers such as the secret-store CSI driver.
Promoted CRD Validation Expression Language to Beta
Promoted Server Side Unknown Field Validation to Beta
Promoted the ServerSideFieldValidation feature gate to beta (on by default). This allows optionally triggering schema validation on the API server that errors when unknown fields are detected. This allows the removal of client-side validation from kubectl while maintaining the same core functionality of erroring out on requests that contain unknown or invalid fields.
Introduced KMS v2 API
Introduce KMS v2alpha1 API to add performance, rotation, and observability improvements. Encrypt data at rest (ie Kubernetes Secrets) with DEK using AES-GCM instead of AES-CBC for kms data encryption. No user action is required. Reads with AES-GCM and AES-CBC will continue to be allowed. See the guide Using a KMS provider for data encryption for more information.
Other Updates
Graduations to Stable
This release includes a total of thirteen enhancements promoted to stable:
The complete details of the Kubernetes v1.25 release are available in our release notes.
Availability
Kubernetes v1.25 is available for download on GitHub.
To get started with Kubernetes, check out these interactive tutorials or run local
Kubernetes clusters using containers as “nodes”, with kind.
You can also easily install 1.25 using kubeadm.
Release Team
Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that, when combined, make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.
We would like to thank the entire release team for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.25 release for our community. Every one of you had a part to play in building this, and you all executed beautifully. We would like to extend special thanks to our fearless release lead, Cici Huang, for all she did to guarantee we had what we needed to succeed.
KubeCon + CloudNativeCon North America 2022 will take place in Detroit, Michigan from 24 – 28 October 2022! You can find more information about the conference and registration on the event site.
KubeDay event series kicks off with KubeDay Japan December 7! Register or submit a proposal on the event site
The CNCF K8s DevStats project
aggregates a number of interesting data points related to the velocity of Kubernetes and various
sub-projects. This includes everything from individual contributions to the number of companies that
are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.
Join members of the Kubernetes v1.25 release team on Thursday September 22, 2022 10am – 11am PT to learn about
the major features of this release, as well as deprecations and removals to help plan for upgrades.
For more information and registration, visit the event page.
Get Involved
The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests.
Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:
Since the very beginning of Kubernetes, the topic of persistent data and how to address the requirement of stateful applications has been an important topic. Support for stateless deployments was natural, present from the start, and garnered attention, becoming very well-known. Work on better support for stateful applications was also present from early on, with each release increasing the scope of what could be run on Kubernetes.
Message queues, databases, clustered filesystems: these are some examples of the solutions that have different storage requirements and that are, today, increasingly deployed in Kubernetes. Dealing with ephemeral and persistent storage, local or remote, file or block, from many different vendors, while considering how to provide the needed resiliency and data consistency that users expect, all of this is under SIG Storage's umbrella.
In this SIG Storage spotlight, Frederico Muñoz (Cloud & Architecture Lead at SAS) talked with Xing Yang, Tech Lead at VMware and co-chair of SIG Storage, on how the SIG is organized, what are the current challenges and how anyone can get involved and contribute.
About SIG Storage
Frederico (FSM): Hello, thank you for the opportunity of learning more about SIG Storage. Could you tell us a bit about yourself, your role, and how you got involved in SIG Storage.
Xing Yang (XY): I am a Tech Lead at VMware, working on Cloud Native Storage. I am also a Co-Chair of SIG Storage. I started to get involved in K8s SIG Storage at the end of 2017, starting with contributing to the VolumeSnapshot project. At that time, the VolumeSnapshot project was still in an experimental, pre-alpha stage. It needed contributors. So I volunteered to help. Then I worked with other community members to bring VolumeSnapshot to Alpha in K8s 1.12 release in 2018, Beta in K8s 1.17 in 2019, and eventually GA in 1.20 in 2020.
FSM: Reading the SIG Storage charter alone it’s clear that SIG Storage covers a lot of ground, could you describe how the SIG is organised?
XY: In SIG Storage, there are two Co-Chairs and two Tech Leads. Saad Ali from Google and myself are Co-Chairs. Michelle Au from Google and Jan Šafránek from Red Hat are Tech Leads.
We have bi-weekly meetings where we go through features we are working on for each particular release, getting the statuses, making sure each feature has dev owners and reviewers working on it, and reminding people about the release deadlines, etc. More information on the SIG is on the community page. People can also add PRs that need attention, design proposals that need discussion, and other topics to the meeting agenda doc. We will go over them after project tracking is done.
We also have other regular meetings, i.e., CSI Implementation meeting, Object Bucket API design meeting, and one-off meetings for specific topics if needed. There is also a K8s Data Protection Workgroup that is sponsored by SIG Storage and SIG Apps. SIG Storage owns or co-owns features that are being discussed at the Data Protection WG.
Storage and Kubernetes
FSM: Storage is such a foundational component in so many things, not least in Kubernetes: what do you think are the Kubernetes-specific challenges in terms of storage management?
XY: In Kubernetes, there are multiple components involved for a volume operation. For example, creating a Pod to use a PVC has multiple components involved. There are the Attach Detach Controller and the external-attacher working on attaching the PVC to the pod. There’s the Kubelet that works on mounting the PVC to the pod. Of course the CSI driver is involved as well. There could be race conditions sometimes when coordinating between multiple components.
Another challenge is regarding core vs Custom Resource Definitions (CRD), not really storage specific. CRD is a great way to extend Kubernetes capabilities while not adding too much code to the Kubernetes core itself. However, this also means there are many external components that are needed when running a Kubernetes cluster.
From the SIG Storage side, one most notable example is Volume Snapshot. Volume Snapshot APIs are defined as CRDs. API definitions and controllers are out-of-tree. There is a common snapshot controller and a snapshot validation webhook that should be deployed on the control plane, similar to how kube-controller-manager is deployed. Although Volume Snapshot is a CRD, it is a core feature of SIG Storage. It is recommended for the K8s cluster distros to deploy Volume Snapshot CRDs, the snapshot controller, and the snapshot validation webhook, however, most of the time we don’t see distros deploy them. So this becomes a problem for the storage vendors: now it becomes their responsibility to deploy these non-driver specific common components. This could cause conflicts if a customer wants to use more than one storage system and deploy more than one CSI driver.
FSM: Not only the complexity of a single storage system, you have to consider how they will be used together in Kubernetes?
XY: Yes, there are many different storage systems that can provide storage to containers in Kubernetes. They don’t work the same way. It is challenging to find a solution that works for everyone.
FSM: Storage in Kubernetes also involves interacting with external solutions, perhaps more so than other parts of Kubernetes. Is this interaction with vendors and external providers challenging? Has it evolved with time in any way?
XY: Yes, it is definitely challenging. Initially Kubernetes storage had in-tree volume plugin interfaces. Multiple storage vendors implemented in-tree interfaces and have volume plugins in the Kubernetes core code base. This caused lots of problems. If there is a bug in a volume plugin, it affects the entire Kubernetes code base. All volume plugins must be released together with Kubernetes. There was no flexibility if storage vendors need to fix a bug in their plugin or want to align with their own product release.
FSM: That’s where CSI enters the game?
XY: Exactly, then there comes Container Storage Interface (CSI). This is an industry standard trying to design common storage interfaces so that a storage vendor can write one plugin and have it work across a range of container orchestration systems (CO). Now Kubernetes is the main CO, but back when CSI just started, there were Docker, Mesos, Cloud Foundry, in addition to Kubernetes. CSI drivers are out-of-tree so bug fixes and releases can happen at their own pace.
CSI is definitely a big improvement compared to in-tree volume plugins. Kubernetes implementation of CSI has been GA since the 1.13 release. It has come a long way. SIG Storage has been working on moving in-tree volume plugins to out-of-tree CSI drivers for several releases now.
FSM: Moving drivers away from the Kubernetes main tree and into CSI was an important improvement.
XY: CSI interface is an improvement over the in-tree volume plugin interface, however, there are still challenges. There are lots of storage systems. Currently there are more than 100 CSI drivers listed in CSI driver docs. These storage systems are also very diverse. So it is difficult to design a common API that works for all. We introduced capabilities at CSI driver level, but we also have challenges when volumes provisioned by the same driver have different behaviors. The other day we just had a meeting discussing Per Volume CSI Driver Capabilities. We have a problem differentiating some CSI driver capabilities when the same driver supports both block and file volumes. We are going to have follow up meetings to discuss this problem.
Ongoing challenges
FSM: Specifically for the 1.25 release we can see that there are a relevant number of storage-related KEPs in the pipeline, would you say that this release is particularly important for the SIG?
XY: I wouldn’t say one release is more important than other releases. In any given release, we are working on a few very important things.
FSM: Indeed, but are there any 1.25 specific specificities and highlights you would like to point out though?
XY: Yes. For the 1.25 release, I want to highlight the following:
CSI Migration is an on-going effort that SIG Storage has been working on for a few releases now. The goal is to move in-tree volume plugins to out-of-tree CSI drivers and eventually remove the in-tree volume plugins. There are 7 KEPs that we are targeting in 1.25 are related to CSI migration. There is one core KEP for the general CSI Migration feature. That is targeting GA in 1.25. CSI Migration for GCE PD and AWS EBS are targeting GA. CSI Migration for vSphere is targeting to have the feature gate on by default while staying in 1.25 that are in Beta. Ceph RBD and PortWorx are targeting Beta, with feature gate off by default. Ceph FS is targeting Alpha.
The second one I want to highlight is COSI, the Container Object Storage Interface. This is a sub-project under SIG Storage. COSI proposes object storage Kubernetes APIs to support orchestration of object store operations for Kubernetes workloads. It also introduces gRPC interfaces for object storage providers to write drivers to provision buckets. The COSI team has been working on this project for more than two years now. The COSI feature is targeting Alpha in 1.25. The KEP just got merged. The COSI team is working on updating the implementation based on the updated KEP.
Another feature I want to mention is CSI Ephemeral Volume support. This feature allows CSI volumes to be specified directly in the pod specification for ephemeral use cases. They can be used to inject arbitrary states, such as configuration, secrets, identity, variables or similar information, directly inside pods using a mounted volume. This was initially introduced in 1.15 as an alpha feature, and it is now targeting GA in 1.25.
FSM: If you had to single something out, what would be the most pressing areas the SIG is working on?
XY: CSI migration is definitely one area that the SIG has put in lots of effort and it has been on-going for multiple releases now. It involves work from multiple cloud providers and storage vendors as well.
Community involvement
FSM: Kubernetes is a community-driven project. Any recommendation for anyone looking into getting involved in SIG Storage work? Where should they start?
XY: Take a look at the SIG Storage community page, it has lots of information on how to get started. There are SIG annual reports that tell you what we did each year. Take a look at the Contributing guide. It has links to presentations that can help you get familiar with Kubernetes storage concepts.
Join our bi-weekly meetings on Thursdays. Learn how the SIG operates and what we are working on for each release. Find a project that you are interested in and help out. As I mentioned earlier, I got started in SIG Storage by contributing to the Volume Snapshot project.
FSM: Any closing thoughts you would like to add?
XY: SIG Storage always welcomes new contributors. We need contributors to help with building new features, fixing bugs, doing code reviews, writing tests, monitoring test grid health, and improving documentation, etc.
FSM: Thank you so much for your time and insights into the workings of SIG Storage!