Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you may not be able to execute some actions.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

Kubernetes News

Blog: Current State: 2019 Third Party Security Audit of Kubernetes
• sorangutan

1
Posts

21
Views

Authors (in alphabetical order): Cailyn Edwards (Shopify), Pushkar Joglekar (VMware), Rey Lejano (SUSE) and Rory McCune (DataDog)

We expect the brand new Third Party Security Audit of Kubernetes will be published later this month (Oct 2022).

In preparation for that, let's look at the state of findings that were made public as part of the last third party security audit of 2019 that was based on Kubernetes v1.13.4.

Motivation

Craig Ingram has graciously attempted over the years to keep track of the status of the findings reported in the last audit in this issue: kubernetes/kubernetes#81146. This blog post will attempt to dive deeper into this, address any gaps in tracking and become a point in time summary of the state of the findings reported from 2019.

This article should also help readers gain confidence through transparent communication, of work done by the community to address these findings and bubble up any findings that need help from community contributors.

Current State

The status of each issue / finding here is represented in a best effort manner. Authors do not claim to be 100% accurate on the status and welcome any corrections or feedback if the current state is not reflected accurately by commenting directly on the relevant issue.

#	Title	Issue	Status
1	hostPath PersistentVolumes enable PodSecurityPolicy bypass	#81110	closed, addressed by kubernetes/website#15756 and kubernetes/kubernetes#109798
2	Kubernetes does not facilitate certificate revocation	#81111	duplicate of #18982 and needs a KEP
3	HTTPS connections are not authenticated	#81112	Largely left as an end user exercise in setting up the right configuration
4	TOCTOU when moving PID to manager's cgroup via kubelet	#81113	Requires Node access for successful exploitation. Fix needed
5	Improperly patched directory traversal in `kubectl cp`	#76788	closed, assigned CVE-2019-11249, fixed in #80436
6	Bearer tokens are revealed in logs	#81114	closed, assigned CVE-2019-11250, fixed in #81330
7	Seccomp is disabled by default	#81115	closed, addressed by #101943
8	Pervasive world-accessible file permissions	#81116	#112384 ( in progress)
9	Environment variables expose sensitive data	#81117	closed, addressed by #84992 and #84677
10	Use of InsecureIgnoreHostKey in SSH connections	#81118	This feature was removed in v1.22: #102297
11	Use of InsecureSkipVerify and other TLS weaknesses	#81119	Needs a KEP
12	`kubeadm` performs potentially-dangerous reset operations	#81120	closed, fixed by #81495, #81494, and kubernetes/website#15881
13	Overflows when using strconv.Atoi and downcasting the result	#81121	closed, fixed by #89120
14	kubelet can cause an Out of Memory error with a malicious manifest	#81122	closed, fixed by #76518
15	`kubectl` can cause an Out Of Memory error with a malicious Pod specification	#81123	Fix needed
16	Improper fetching of PIDs allows incorrect cgroup movement	#81124	Fix needed
17	Directory traversal of host logs running kube-apiserver and kubelet	#81125	closed, fixed by #87273
18	Non-constant time password comparison	#81126	closed, fixed by #81152
19	Encryption recommendations not in accordance with best practices	#81127	Work in Progress
20	Adding credentials to containers by default is unsafe	#81128	Closed, fixed by #89193
21	kubelet liveness probes can be used to enumerate host network	#81129	Needs a KEP
22	iSCSI volume storage cleartext secrets in logs	#81130	closed, fixed by #81215
23	Hard coded credential paths	#81131	closed, awaiting more evidence
24	Log rotation is not atomic	#81132	Fix needed
25	Arbitrary file paths without bounding	#81133	Fix needed.
26	Unsafe JSON construction	#81134	Partially fixed
27	kubelet crash due to improperly handled errors	#81135	Closed. Fixed by #81135
28	Legacy tokens do not expire	#81136	closed, fixed as part of #70679
29	CoreDNS leaks internal cluster information across namespaces	#81137	Closed, resolved with CoreDNS v1.6.2. #81137 (comment)
30	Services use questionable default functions	#81138	Fix needed
31	Incorrect docker daemon process name in container manager	#81139	closed, fixed by #81083
32	Use standard formats everywhere	#81140	Needs a KEP
33	Superficial health check provides false sense of safety	#81141	closed, fixed by #81319
34	Hardcoded use of insecure gRPC transport	#81142	Needs a KEP
35	Incorrect handling of `Retry-After`	#81143	closed, fixed by #91048
36	Incorrect isKernelPid check	#81144	closed, fixed by #81086
37	Kubelet supports insecure TLS ciphersuites	#81145	closed but fix needed for #91444 (see this comment)

Inspired outcomes

Apart from fixes to the specific issues, the 2019 third party security audit also motivated security focussed enhancements in the next few releases of Kubernetes. One such example is Kubernetes Enhancement Proposal (KEP) 1933 Defend Against Logging Secrets via Static Analysis to prevent exposing secrets to logs with Patrick Rhomberg driving the implementation. As a result of this KEP, go-flow-levee, a taint propagation analysis tool configured to detect logging of secrets, is executed in a script as a Prow presubmit job. This KEP was introduced in v1.20.0 as an alpha feature, then graduated to beta in v1.21.0, and graduated to stable in v1.23.0. As stable, the analysis runs as a blocking presubmit test. This KEP also helped resolve the following issues from the 2019 third party security audit:

Remaining Work

Many of the 37 findings identified were fixed by work from our community members over the last 3 years. However, we still have some work left to do. Here's a breakdown of remaining work with rough estimates on time commitment, complexity and benefits to the ecosystem on fixing these pending issues.

Note: Anything requiring a KEP (Kubernetes Enhancement Proposal) is considered high time commitment and high complexity. Benefits to Ecosystem are roughly equivalent to risk of keeping the finding unfixed which is determined by Severity Level + Likelihood of a successful vulnerability exploit. These estimates and values in the table below are the authors' personal opinion. An individual or end users' threat model may rate the benefits to fix a particular issue higher or lower.

Title	Issue	Time Commitment	Complexity	Benefit to Ecosystem
Kubernetes does not facilitate certificate revocation	#81111	High	High	Medium
Use of InsecureSkipVerify and other TLS weaknesses	#81119	High	High	Medium
`kubectl` can cause a local Out Of Memory error with a malicious Pod specification	#81123	Medium	Medium	Medium
Improper fetching of PIDs allows incorrect cgroup movement	#81124	Medium	Medium	Medium
kubelet liveness probes can be used to enumerate host network	#81129	High	High	Medium
API Server supports insecure TLS ciphersuites	#81145	Medium	Medium	Low
TOCTOU when moving PID to manager's cgroup via kubelet	#81113	Medium	Medium	Low
Log rotation is not atomic	#81132	Medium	Medium	Low
Arbitrary file paths without bounding	#81133	Medium	Medium	Low
Services use questionable default functions	#81138	Medium	Medium	Low
Use standard formats everywhere	#81140	High	High	Very Low
Hardcoded use of insecure gRPC transport	#81142	High	High	Very Low

To get started on fixing any of these findings that need help, please consider getting involved in Kubernetes SIG Security by joining our bi-weekly meetings or hanging out with us on our Slack Channel.

https://kubernetes.io/blog/2022/10/05/current-state-2019-third-party-audit/

Blog: Introducing Kueue
• sorangutan

1

1
Posts

37
Views
Authors: Abdullah Gharaibeh (Google), Aldo Culquicondor (Google)

Whether on-premises or in the cloud, clusters face real constraints for resource usage, quota, and cost management reasons. Regardless of the autoscalling capabilities, clusters have finite capacity. As a result, users want an easy way to fairly and efficiently share resources.

In this article, we introduce Kueue, an open source job queueing controller designed to manage batch jobs as a single unit. Kueue leaves pod-level orchestration to existing stable components of Kubernetes. Kueue natively supports the Kubernetes Job API and offers hooks for integrating other custom-built APIs for batch jobs.

Why Kueue?

Job queueing is a key feature to run batch workloads at scale in both on-premises and cloud environments. The main goal of job queueing is to manage access to a limited pool of resources shared by multiple tenants. Job queueing decides which jobs should wait, which can start immediately, and what resources they can use.

Some of the most desired job queueing requirements include:
- Quota and budgeting to control who can use what and up to what limit. This is not only needed in clusters with static resources like on-premises, but it is also needed in cloud environments to control spend or usage of scarce resources.
- Fair sharing of resources between tenants. To maximize the usage of available resources, any unused quota assigned to inactive tenants should be allowed to be shared fairly between active tenants.
- Flexible placement of jobs across different resource types based on availability. This is important in cloud environments which have heterogeneous resources such as different architectures (GPU or CPU models) and different provisioning modes (spot vs on-demand).
- Support for autoscaled environments where resources can be provisioned on demand.
Plain Kubernetes doesn't address the above requirements. In normal circumstances, once a Job is created, the job-controller instantly creates the pods and kube-scheduler continuously attempts to assign the pods to nodes. At scale, this situation can work the control plane to death. There is also currently no good way to control at the job level which jobs should get which resources first, and no way to express order or fair sharing. The current ResourceQuota model is not a good fit for these needs because quotas are enforced on resource creation, and there is no queueing of requests. The intent of ResourceQuotas is to provide a builtin reliability mechanism with policies needed by admins to protect clusters from failing over.

In the Kubernetes ecosystem, there are several solutions for job scheduling. However, we found that these alternatives have one or more of the following problems:
- They replace existing stable components of Kubernetes, like kube-scheduler or the job-controller. This is problematic not only from an operational point of view, but also the duplication in the job APIs causes fragmentation of the ecosystem and reduces portability.
- They don't integrate with autoscaling, or
- They lack support for resource flexibility.
How Kueue works

With Kueue we decided to take a different approach to job queueing on Kubernetes that is anchored around the following aspects:
- Not duplicating existing functionalities already offered by established Kubernetes components for pod scheduling, autoscaling and job lifecycle management.
- Adding key features that are missing to existing components. For example, we invested in the Job API to cover more use cases like IndexedJob and fixed long standing issues related to pod tracking. While this path takes longer to land features, we believe it is the more sustainable long term solution.
- Ensuring compatibility with cloud environments where compute resources are elastic and heterogeneous.
For this approach to be feasible, Kueue needs knobs to influence the behavior of those established components so it can effectively manage when and where to start a job. We added those knobs to the Job API in the form of two features:
- Suspend field, which allows Kueue to signal to the job-controller when to start or stop a Job.
- Mutable scheduling directives, which allows Kueue to update a Job's .spec.template.spec.nodeSelector before starting the Job. This way, Kueue can control Pod placement while still delegating to kube-scheduler the actual pod-to-node scheduling.
Note that any custom job API can be managed by Kueue if that API offers the above two capabilities.

Resource model

Kueue defines new APIs to address the requirements mentioned at the beginning of this post. The three main APIs are:
- ResourceFlavor: a cluster-scoped API to define resource flavor available for consumption, like a GPU model. At its core, a ResourceFlavor is a set of labels that mirrors the labels on the nodes that offer those resources.
- ClusterQueue: a cluster-scoped API to define resource pools by setting quotas for one or more ResourceFlavor.
- LocalQueue: a namespaced API for grouping and managing single tenant jobs. In its simplest form, a LocalQueue is a pointer to the ClusterQueue that the tenant (modeled as a namespace) can use to start their jobs.
For more details, take a look at the API concepts documentation. While the three APIs may look overwhelming, most of Kueue’s operations are centered around ClusterQueue; the ResourceFlavor and LocalQueue APIs are mainly organizational wrappers.

Example use case

Imagine the following setup for running batch workloads on a Kubernetes cluster on the cloud:
- You have cluster-autoscaler installed in the cluster to automatically adjust the size of your cluster.
- There are two types of autoscaled node groups that differ on their provisioning policies: spot and on-demand. The nodes of each group are differentiated by the label instance-type=spot or instance-type=ondemand. Moreover, since not all Jobs can tolerate running on spot nodes, the nodes are tainted with spot=true:NoSchedule.
- To strike a balance between cost and resource availability, imagine you want Jobs to use up to 1000 cores of on-demand nodes, then use up to 2000 cores of spot nodes.
As an admin for the batch system, you define two ResourceFlavors that represent the two types of nodes:
```
---
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ResourceFlavor
metadata:
 name: ondemand
 labels:
 instance-type: ondemand 
---
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ResourceFlavor
metadata:
 name: spot
 labels:
 instance-type: spot
taints:
- effect: NoSchedule
 key: spot
 value: "true"
```
Then you define the quotas by creating a ClusterQueue as follows:
```
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ClusterQueue
metadata:
 name: research-pool
spec:
 namespaceSelector: {}
 resources:
 - name: "cpu"
 flavors:
 - name: ondemand
 quota:
 min: 1000
 - name: spot
 quota:
 min: 2000
```
Note that the order of flavors in the ClusterQueue resources matters: Kueue will attempt to fit jobs in the available quotas according to the order unless the job has an explicit affinity to specific flavors.

For each namespace, you define a LocalQueue that points to the ClusterQueue above:
```
apiVersion: kueue.x-k8s.io/v1alpha2
kind: LocalQueue
metadata:
 name: training
 namespace: team-ml
spec:
 clusterQueue: research-pool
```
Admins create the above setup once. Batch users are able to find the queues they are allowed to submit to by listing the LocalQueues in their namespace(s). The command is similar to the following: kubectl get -n my-namespace localqueues

To submit work, create a Job and set the kueue.x-k8s.io/queue-name annotation as follows:
```
apiVersion: batch/v1
kind: Job
metadata:
 generateName: sample-job-
 annotations:
 kueue.x-k8s.io/queue-name: training
spec:
 parallelism: 3
 completions: 3
 template:
 spec:
 tolerations:
 - key: spot
 operator: "Exists"
 effect: "NoSchedule"
 containers:
 - name: example-batch-workload
 image: registry.example/batch/calculate-pi:3.14
 args: ["30s"]
 resources:
 requests:
 cpu: 1
 restartPolicy: Never
```
Kueue intervenes to suspend the Job as soon as it is created. Once the Job is at the head of the ClusterQueue, Kueue evaluates if it can start by checking if the resources requested by the job fit the available quota.

In the above example, the Job tolerates spot resources. If there are previously admitted Jobs consuming all existing on-demand quota but not all of spot’s, Kueue admits the Job using the spot quota. Kueue does this by issuing a single update to the Job object that:
- Changes the .spec.suspend flag to false
- Adds the term instance-type: spot to the job's .spec.template.spec.nodeSelector so that when the pods are created by the job controller, those pods can only schedule onto spot nodes.
Finally, if there are available empty nodes with matching node selector terms, then kube-scheduler will directly schedule the pods. If not, then kube-scheduler will initially mark the pods as unschedulable, which will trigger the cluster-autoscaler to provision new nodes.

Future work and getting involved

The example above offers a glimpse of some of Kueue's features including support for quota, resource flexibility, and integration with cluster autoscaler. Kueue also supports fair-sharing, job priorities, and different queueing strategies. Take a look at the Kueue documentation to learn more about those features and how to use Kueue.

We have a number of features that we plan to add to Kueue, such as hierarchical quota, budgets, and support for dynamically sized jobs. In the more immediate future, we are focused on adding support for job preemption.

The latest Kueue release is available on Github; try it out if you run batch workloads on Kubernetes (requires v1.22 or newer). We are in the early stages of this project and we are seeking feedback of all levels, major or minor, so please don’t hesitate to reach out. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation. You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contributors who made this project possible!

https://kubernetes.io/blog/2022/10/04/introducing-kueue/
Blog: Kubernetes 1.25: alpha support for running Pods with user namespaces
• sorangutan

1

1
Posts

37
Views
Authors: Rodrigo Campos (Microsoft), Giuseppe Scrivano (Red Hat)

Kubernetes v1.25 introduces the support for user namespaces.

This is a major improvement for running secure workloads in Kubernetes. Each pod will have access only to a limited subset of the available UIDs and GIDs on the system, thus adding a new security layer to protect from other pods running on the same system.

How does it work?

A process running on Linux can use up to 4294967296 different UIDs and GIDs.

User namespaces is a Linux feature that allows mapping a set of users in the container to different users in the host, thus restricting what IDs a process can effectively use. Furthermore, the capabilities granted in a new user namespace do not apply in the host initial namespaces.

Why is it important?

There are mainly two reasons why user namespaces are important:
- improve security since they restrict the IDs a pod can use, so each pod can run in its own separate environment with unique IDs.
- enable running workloads as root in a safer manner.
In a user namespace we can map the root user inside the pod to a non-zero ID outside the container, containers believe in running as root while they are a regular unprivileged ID from the host point of view.

The process can keep capabilities that are usually restricted to privileged pods and do it in a safe way since the capabilities granted in a new user namespace do not apply in the host initial namespaces.

How do I enable user namespaces?

At the moment, user namespaces support is opt-in, so you must enable it for a pod setting hostUsers to false under the pod spec stanza:
```
apiVersion: v1
kind: Pod
spec:
hostUsers: false
containers:
- name: nginx
image: docker.io/nginx
```
The feature is behind a feature gate, so make sure to enable the UserNamespacesStatelessPodsSupport gate before you can use the new feature.

The runtime must also support user namespaces:
- containerd: support is planned for the 1.7 release. See containerd issue #7063 for more details.
- CRI-O: v1.25 has support for user namespaces.
Support for this in cri-dockerd is not planned yet.

How do I get involved?

You can reach SIG Node by several means:
- Slack: #sig-node
- Mailing list
- Open Community Issues/PRs
You can also contact us directly:
- GitHub / Slack: @rata @giuseppe
https://kubernetes.io/blog/2022/10/03/userns-alpha/

Blog: Enforce CRD Immutability with CEL Transition Rules
• sorangutan

1
Posts

49
Views

Author: Alexander Zielenski (Google)

Immutable fields can be found in a few places in the built-in Kubernetes types. For example, you can't change the .metadata.name of an object. Specific objects have fields where changes to existing objects are constrained; for example, the .spec.selector of a Deployment.

Aside from simple immutability, there are other common design patterns such as lists which are append-only, or a map with mutable values and immutable keys.

Until recently the best way to restrict field mutability for CustomResourceDefinitions has been to create a validating admission webhook: this means a lot of complexity for the common case of making a field immutable.

Beta since Kubernetes 1.25, CEL Validation Rules allow CRD authors to express validation constraints on their fields using a rich expression language, CEL. This article explores how you can use validation rules to implement a few common immutability patterns directly in the manifest for a CRD.

Basics of validation rules

The new support for CEL validation rules in Kubernetes allows CRD authors to add complicated admission logic for their resources without writing any code!

For example, A CEL rule to constrain a field maximumSize to be greater than a minimumSize for a CRD might look like the following:

rule: |
  self.maximumSize > self.minimumSize
message: 'Maximum size must be greater than minimum size.'

The rule field contains an expression written in CEL. self is a special keyword in CEL which refers to the object whose type contains the rule.

The message field is an error message which will be sent to Kubernetes clients whenever this particular rule is not satisfied.

For more details about the capabilities and limitations of Validation Rules using CEL, please refer to validation rules. The CEL specification is also a good reference for information specifically related to the language.

Immutability patterns with CEL validation rules

This section implements several common use cases for immutability in Kubernetes CustomResourceDefinitions, using validation rules expressed as kubebuilder marker comments. Resultant OpenAPI generated by the kubebuilder marker comments will also be included so that if you are writing your CRD manifests by hand you can still follow along.

Project setup

To use CEL rules with kubebuilder comments, you first need to set up a Golang project structure with the CRD defined in Go.

You may skip this step if you are not using kubebuilder or are only interested in the resultant OpenAPI extensions.

Begin with a folder structure of a Go module set up like the following. If you have your own project already set up feel free to adapt this tutorial to your liking:

graph LR . --> generate.go . --> pkg --> apis --> stable.example.com --> v1 v1 --> doc.go v1 --> types.go . --> tools.go

This is the typical folder structure used by Kubernetes projects for defining new API resources.

doc.go contains package-level metadata such as the group and the version:

// +groupName=stable.example.com
// +versionName=v1
package v1

types.go contains all type definitions in stable.example.com/v1

package v1

import (
 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// An empty CRD as an example of defining a type using controller tools
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
type TestCRD struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 Spec TestCRDSpec `json:"spec,omitempty"`
 Status TestCRDStatus `json:"status,omitempty"`
}

type TestCRDStatus struct {}
type TestCRDSpec struct {
 // You will fill this in as you go along
}

tools.go contains a dependency on controller-gen which will be used to generate the CRD definition:

//go:build tools

package celimmutabilitytutorial

// Force direct dependency on code-generator so that it may be executed with go run
import (
 _ "sigs.k8s.io/controller-tools/cmd/controller-gen"
)

Finally, generate.gocontains a go:generate directive to make use of controller-gen. controller-gen parses our types.go and creates generates CRD yaml files into a crd folder:

package celimmutabilitytutorial

//go:generate go run sigs.k8s.io/controller-tools/cmd/controller-gen crd paths=./pkg/apis/... output:dir=./crds

You may now want to add dependencies for our definitions and test the code generation:

cd cel-immutability-tutorial
go mod init <your-org>/<your-module-name>
go mod tidy
go generate ./...

After running these commands you now have completed the basic project structure. Your folder tree should look like the following:

graph LR . --> crds --> stable.example.com_testcrds.yaml . --> generate.go . --> go.mod . --> go.sum . --> pkg --> apis --> stable.example.com --> v1 v1 --> doc.go v1 --> types.go . --> tools.go

The manifest for the example CRD is now available in crds/stable.example.com_testcrds.yaml.

Immutablility after first modification

A common immutability design pattern is to make the field immutable once it has been first set. This example will throw a validation error if the field after changes after being first initialized.

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type ImmutableSinceFirstWrite struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Optional
 // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
 // +kubebuilder:validation:MaxLength=512
 Value string `json:"value"`
}

The +kubebuilder directives in the comments inform controller-gen how to annotate the generated OpenAPI. The XValidation rule causes the rule to appear among the x-kubernetes-validations OpenAPI extension. Kubernetes then respects the OpenAPI spec to enforce our constraints.

To enforce a field's immutability after its first write, you need to apply the following constraints:

Field must be allowed to be initially unset +kubebuilder:validation:Optional
Once set, field must not be allowed to be removed: !has(oldSelf.value) | has(self.value) (type-scoped rule)
Once set, field must not be allowed to change value self == oldSelf (field-scoped rule)

Also note the additional directive +kubebuilder:validation:MaxLength. CEL requires that all strings have attached max length so that it may estimate the computation cost of the rule. Rules that are too expensive will be rejected. For more information on CEL cost budgeting, check out the other tutorial.

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincefirstwrites.yaml

customresourcedefinition.apiextensions.k8s.io/immutablesincefirstwrites.stable.example.com created

Creating initial empty object with no value is permitted since value is optional:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
EOF

immutablesincefirstwrite.stable.example.com/test1 created

The initial modification of value succeeds:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
value: Hello, world!
EOF

immutablesincefirstwrite.stable.example.com/test1 configured

An attempt to change value is blocked by the field-level validation rule. Note the error message shown to the user comes from the validation rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
value: Hello, new world!
EOF

The ImmutableSinceFirstWrite "test1" is invalid: value: Invalid value: "string": Value is immutable

An attempt to remove the value field altogether is blocked by the other validation rule on the type. The error message also comes from the rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
 name: test1
EOF

The ImmutableSinceFirstWrite "test1" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

Note that in the generated schema there are two separate rule locations. One is directly attached to the property immutable_since_first_write. The other rule is associated with the crd type itself.

openAPIV3Schema:
 properties:
 value:
 maxLength: 512
 type: string
 x-kubernetes-validations:
 - message: Value is immutable
 rule: self == oldSelf
 type: object
 x-kubernetes-validations:
 - message: Value is required once set
 rule: '!has(oldSelf.value) || has(self.value)'

Immutability upon object creation

A field which is immutable upon creation time is implemented similarly to the earlier example. The difference is that that field is marked required, and the type-scoped rule is no longer necessary.

type ImmutableSinceCreation struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Required
 // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
 // +kubebuilder:validation:MaxLength=512
 Value string `json:"value"`
}

This field will be required when the object is created, and after that point will not be allowed to be modified. Our CEL Validation Rule self == oldSelf

Usage example

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincecreations.yaml

customresourcedefinition.apiextensions.k8s.io/immutablesincecreations.stable.example.com created

Applying an object without the required field should fail:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
EOF

The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Now that the field has been added, the operation is permitted:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
value: Hello, world!
EOF

immutablesincecreation.stable.example.com/test1 created

If you attempt to change the value, the operation is blocked due to the validation rules in the CRD. Note that the error message is as it was defined in the validation rule.

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
value: Hello, new world!
EOF

The ImmutableSinceCreation "test1" is invalid: value: Invalid value: "string": Value is immutable

Also if you attempted to remove value altogether after adding it, you will see an error as expected:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
 name: test1
EOF

The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Generated schema

openAPIV3Schema:
 properties:
 value:
 maxLength: 512
 type: string
 x-kubernetes-validations:
 - message: Value is immutable
 rule: self == oldSelf
 required:
 - value
 type: object

Append-only list of containers

In the case of ephemeral containers on Pods, Kubernetes enforces that the elements in the list are immutable, and can’t be removed. The following example shows how you could use CEL to achieve the same behavior.

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type AppendOnlyList struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Optional
 // +kubebuilder:validation:MaxItems=100
 // +kubebuilder:validation:XValidation:rule="oldSelf.all(x, x in self)",message="Values may only be added"
 Values []v1.EphemeralContainer `json:"value"`
}

Once set, field must not be deleted: !has(oldSelf.value) || has(self.value) (type-scoped)
Once a value is added it is not removed: oldSelf.all(x, x in self) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional

Note that for cost-budgeting purposes, MaxItems is also required to be specified.

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_appendonlylists.yaml

customresourcedefinition.apiextensions.k8s.io/appendonlylists.stable.example.com created

Creating an inital list with one element inside should succeed without problem:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
value:
 - name: container1
 image: nginx/nginx
EOF

appendonlylist.stable.example.com/testlist created

Adding an element to the list should also proceed without issue as expected:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
value:
 - name: container1
 image: nginx/nginx
 - name: container2
 image: mongodb/mongodb
EOF

appendonlylist.stable.example.com/testlist configured

But if you now attempt to remove an element, the error from the validation rule is triggered:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
value:
 - name: container1
 image: nginx/nginx
EOF

The AppendOnlyList "testlist" is invalid: value: Invalid value: "array": Values may only be added

Additionally, to attempt to remove the field once it has been set is also disallowed by the type-scoped validation rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
 name: testlist
EOF

The AppendOnlyList "testlist" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

openAPIV3Schema:
 properties:
 value:
 items: ...
 maxItems: 100
 type: array
 x-kubernetes-validations:
 - message: Values may only be added
 rule: oldSelf.all(x, x in self)
 type: object
 x-kubernetes-validations:
 - message: Value is required once set
 rule: '!has(oldSelf.value) || has(self.value)'

Map with append-only keys, immutable values

// A map which does not allow keys to be removed or their values changed once set. New keys may be added, however.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.values) || has(self.values)", message="Value is required once set"
type MapAppendOnlyKeys struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 // +kubebuilder:validation:Optional
 // +kubebuilder:validation:MaxProperties=10
 // +kubebuilder:validation:XValidation:rule="oldSelf.all(key, key in self && self[key] == oldSelf[key])",message="Keys may not be removed and their values must stay the same"
 Values map[string]string `json:"values,omitempty"`
}

Once set, field must not be deleted: !has(oldSelf.values) || has(self.values) (type-scoped)
Once a key is added it is not removed nor is its value modified: oldSelf.all(key, key in self && self[key] == oldSelf[key]) (field-scoped)
Value may be initially unset: +kubebuilder:validation:Optional

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_mapappendonlykeys.yaml

customresourcedefinition.apiextensions.k8s.io/mapappendonlykeys.stable.example.com created

Creating an initial object with one key within values should be permitted:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
values:
 key1: value1
EOF

mapappendonlykeys.stable.example.com/testmap created

Adding new keys to the map should also be permitted:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
values:
 key1: value1
 key2: value2
EOF

mapappendonlykeys.stable.example.com/testmap configured

But if a key is removed, the error messagr from the validation rule should be returned:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
values:
 key1: value1
EOF

The MapAppendOnlyKeys "testmap" is invalid: values: Invalid value: "object": Keys may not be removed and their values must stay the same

If the entire field is removed, the other validation rule is triggered and the operation is prevented. Note that the error message for the validation rule is shown to the user.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
 name: testmap
EOF

The MapAppendOnlyKeys "testmap" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

openAPIV3Schema:
 description: A map which does not allow keys to be removed or their values
 changed once set. New keys may be added, however.
 properties:
 values:
 additionalProperties:
 type: string
 maxProperties: 10
 type: object
 x-kubernetes-validations:
 - message: Keys may not be removed and their values must stay the same
 rule: oldSelf.all(key, key in self && self[key] == oldSelf[key])
 type: object
 x-kubernetes-validations:
 - message: Value is required once set
 rule: '!has(oldSelf.values) || has(self.values)'

Going further

The above examples showed how CEL rules can be added to kubebuilder types. The same rules can be added directly to OpenAPI if writing a manifest for a CRD by hand.

For native types, the same behavior can be achieved using kube-openapi’s marker +validations.

Usage of CEL within Kubernetes Validation Rules is so much more powerful than what has been shown in this article. For more information please check out validation rules in the Kubernetes documentation and CRD Validation Rules Beta blog post.

https://kubernetes.io/blog/2022/09/29/enforce-immutability-using-cel/

Blog: Kubernetes 1.25: Kubernetes In-Tree to CSI Volume Migration Status Update
• sorangutan

1
Posts

29
Views

Author: Jiawei Wang (Google)

The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure has already been beta since v1.17. CSI migration was introduced as alpha in Kubernetes v1.14. Since then, SIG Storage and other Kubernetes special interest groups are working to ensure feature stability and compatibility in preparation for CSI Migration feature to go GA.

SIG Storage is excited to announce that the core CSI Migration feature is generally available in Kubernetes v1.25 release!

SIG Storage wrote a blog post in v1.23 for CSI Migration status update which discussed the CSI migration status for each storage driver. It has been a while and this article is intended to give a latest status update on each storage driver for their CSI Migration status in Kubernetes v1.25.

Quick recap: What is CSI Migration, and why migrate?

The Container Storage Interface (CSI) was designed to help Kubernetes replace its existing, in-tree storage driver mechanisms - especially vendor specific plugins. Kubernetes support for the Container Storage Interface has been generally available since Kubernetes v1.13. Support for using CSI drivers was introduced to make it easier to add and maintain new integrations between Kubernetes and storage backend technologies. Using CSI drivers allows for better maintainability (driver authors can define their own release cycle and support lifecycle) and reduce the opportunity for vulnerabilities (with less in-tree code, the risks of a mistake are reduced, and cluster operators can select only the storage drivers that their cluster requires).

As more CSI Drivers were created and became production ready, SIG Storage wanted all Kubernetes users to benefit from the CSI model. However, we could not break API compatibility with the existing storage API types due to k8s architecture conventions. The solution we came up with was CSI migration: a feature that translates in-tree APIs to equivalent CSI APIs and delegates operations to a replacement CSI driver.

The CSI migration effort enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver from the storage backend. If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. Existing StorageClass, PersistentVolume and PersistentVolumeClaim objects should continue to work. When a Kubernetes cluster administrator updates a cluster to enable CSI migration, existing workloads that utilize PVCs which are backed by in-tree storage plugins will continue to function as they always have. However, behind the scenes, Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers.

For example, suppose you are a kubernetes.io/gce-pd user; after CSI migration, you can still use kubernetes.io/gce-pd to provision new volumes, mount existing GCE-PD volumes or delete existing volumes. All existing APIs and Interface will still function correctly. However, the underlying function calls are all going through the GCE PD CSI driver instead of the in-tree Kubernetes function.

This enables a smooth transition for end users. Additionally as storage plugin developers, we can reduce the burden of maintaining the in-tree storage plugins and eventually remove them from the core Kubernetes binary.

What is the timeline / status?

The current and targeted releases for each individual driver is shown in the table below:

Driver	Alpha	Beta (in-tree deprecated)	Beta (on-by-default)	GA	Target "in-tree plugin" removal
AWS EBS	1.14	1.17	1.23	1.25	1.27 (Target)
Azure Disk	1.15	1.19	1.23	1.24	1.26 (Target)
Azure File	1.15	1.21	1.24	1.26 (Target)	1.28 (Target)
Ceph FS	1.26 (Target)
Ceph RBD	1.23	1.26 (Target)	1.27 (Target)	1.28 (Target)	1.30 (Target)
GCE PD	1.14	1.17	1.23	1.25	1.27 (Target)
OpenStack Cinder	1.14	1.18	1.21	1.24	1.26 (Target)
Portworx	1.23	1.25	1.26 (Target)	1.27 (Target)	1.29 (Target)
vSphere	1.18	1.19	1.25	1.26 (Target)	1.28 (Target)

The following storage drivers will not have CSI migration support. The scaleio, flocker, quobyte and storageos drivers were removed; the others are deprecated and will be removed from core Kubernetes in the coming releases.

Driver	Deprecated	Code Removal
Flocker	1.22	1.25
GlusterFS	1.25	1.26 (Target)
Quobyte	1.22	1.25
ScaleIO	1.16	1.22
StorageOS	1.22	1.25

What does it mean for the core CSI Migration feature to go GA?

Core CSI Migration goes to GA means that the general framework, core library and API for CSI migration is stable for Kubernetes v1.25 and will be part of future Kubernetes releases as well.

If you are a Kubernetes distribution maintainer, this means if you disabled CSIMigration feature gate previously, you are no longer allowed to do so because the feature gate has been locked.
If you are a Kubernetes storage driver developer, this means you can expect no backwards incompatibility changes in the CSI migration library.
If you are a Kubernetes maintainer, expect nothing changes from your day to day development flows.
If you are a Kubernetes user, expect nothing to change from your day-to-day usage flows. If you encounter any storage related issues, contact the people who operate your cluster (if that's you, contact the provider of your Kubernetes distribution, or get help from the community).

What does it mean for the storage driver CSI migration to go GA?

Storage Driver CSI Migration goes to GA means that the specific storage driver supports CSI Migration. Expect feature parity between the in-tree plugin with the CSI driver.

If you are a Kubernetes distribution maintainer, make sure you install the corresponding CSI driver on the distribution. And make sure you are not disabling the specific CSIMigration{provider} flag, as they are locked.
If you are a Kubernetes storage driver maintainer, make sure the CSI driver can ensure feature parity if it supports CSI migration.
If you are a Kubernetes maintainer/developer, expect nothing to change from your day-to-day development flows.
If you are a Kubernetes user, the CSI Migration feature should be completely transparent to you, the only requirement is to install the corresponding CSI driver.

What's next?

We are expecting cloud provider in-tree storage plugins code removal to start to happen as part of the v1.26 and v1.27 releases of Kubernetes. More and more drivers that support CSI migration will go GA in the upcoming releases.

How do I get involved?

The Kubernetes Slack channel #csi-migration along with any of the standard SIG Storage communication channels are great ways to reach out to the SIG Storage and migration working group teams.

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. We offer a huge thank you to the contributors who stepped up these last quarters to help move the project forward:

Xing Yang (xing-yang)
Hemant Kumar (gnufied)

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CSI migration feature:

Andy Zhang (andyzhangz)
Divyen Patel (divyenpatel)
Deep Debroy (ddebroy)
Humble Devassy Chirammal (humblec)
Ismail Alidzhikov (ialidzhikov)
Jordan Liggitt (liggitt)
Matthew Cary (mattcary)
Matthew Wong (wongma7)
Neha Arora (nearora-msft)
Oksana Naumov (trierra)
Saad Ali (saad-ali)
Michelle Au (msau42)

Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

https://kubernetes.io/blog/2022/09/26/storage-in-tree-to-csi-migration-status-update-1.25/

Blog: Kubernetes 1.25: CustomResourceDefinition Validation Rules Graduate to Beta
• sorangutan

1
Posts

33
Views

Authors: Joe Betz (Google), Cici Huang (Google), Kermit Alexander (Google)

In Kubernetes 1.25, Validation rules for CustomResourceDefinitions (CRDs) have graduated to Beta!

Validation rules make it possible to declare how custom resources are validated using the Common Expression Language (CEL). For example:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
 ...
 openAPIV3Schema:
 type: object
 properties:
 spec:
 type: object
 x-kubernetes-validations:
 - rule: "self.minReplicas <= self.replicas && self.replicas <= self.maxReplicas"
 message: "replicas should be in the range minReplicas..maxReplicas."
 properties:
 replicas:
 type: integer
 ...

Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:

Validation Rule	Purpose
`self.minReplicas <= self.replicas`	Validate an integer field is less than or equal to another integer field
`'Available' in self.stateCounts`	Validate an entry with the 'Available' key exists in a map
`self.set1.all(e, !(e in self.set2))`	Validate that the elements of two sets are disjoint
`self == oldSelf`	Validate that a required field is immutable once it is set
`self.created + self.ttl < self.expired`	Validate that 'expired' date is after a 'create' date plus a 'ttl' duration

Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.

Why CEL?

CEL was chosen as the language for validation rules for a couple reasons:

CEL expressions can easily be inlined into CRD schemas. They are sufficiently expressive to replace the vast majority of CRD validation checks currently implemented in admission webhooks. This results in CRDs that are self-contained and are easier to understand.
CEL expressions are compiled and type checked against a CRD's schema "ahead-of-time" (when CRDs are created and updated) allowing them to be evaluated efficiently and safely "runtime" (when custom resources are validated). Even regex string literals in CEL are validated and pre-compiled when CRDs are created or updated.

Why not use validation webhooks?

Benefits of using validation rules when compared with validation webhooks:

CRD authors benefit from a simpler workflow since validation rules eliminate the need to develop and maintain a webhook.
Cluster administrators benefit by no longer having to install, upgrade and operate webhooks for the purposes of CRD validation.
Cluster operability improves because CRD validation no longer requires a remote call to a webhook endpoint, eliminating a potential point of failure in the request-serving-path of the Kubernetes API server. This allows clusters to retain high availability while scaling to larger amounts of installed CRD extensions, since expected control plane availability would otherwise decrease with each additional webhook installed.

Getting started with validation rules

Writing validation rules in OpenAPIv3 schemas

You can define validation rules for any level of a CRD's OpenAPIv3 schema. Validation rules are automatically scoped to their location in the schema where they are declared.

Good practices for CRD validation rules:

Scope validation rules as close as possible to the fields(s) they validate.
Use multiple rules when validating independent constraints.
Do not use validation rules for validations already
Use OpenAPIv3 value validations (maxLength, maxItems, maxProperties, required, enum, minimum, maximum, ..) and string formats where available.
Use x-kubernetes-int-or-string, x-kubernetes-embedded-type and x-kubernetes-list-type=(set|map) were appropriate.

Examples of good practice:

Validation	Best Practice	Example(s)
Validate an integer is between 0 and 100.	Use OpenAPIv3 value validations.	type: integer minimum: 0 maximum: 100
Constraint the max size limits on maps (objects with additionalProperties), arrays and string.	Use OpenAPIv3 value validations. Recommended for all maps, arrays and strings. This best practice is essential for rule cost estimation (explained below).	type: maxItems: 100
Require a date-time be more recent than a particular timestamp.	Use OpenAPIv3 string formats to declare that the field is a date-time. Use validation rules to compare it to a particular timestamp.	type: string format: date-time x-kubernetes-validations: - rule: "self >= timestamp('2000-01-01T00:00:00.000Z')"
Require two sets to be disjoint.	Use x-kubernetes-list-type to validate that the arrays are sets. Use validation rules to validate the sets are disjoint.	type: object properties: set1: type: array x-kubernetes-list-type: set set2: ... x-kubernetes-validations: - rule: "!self.set1.all(e, !(e in self.set2))"

CRD transition rules

Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.

Transition rule examples:

Transition Rule	Purpose
`self == oldSelf`	For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.
(on parent of field) `has(self.field) == has(oldSelf.field)` on field: `self == oldSelf`	Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).
`self.all(x, x in oldSelf)`	Only allow adding items to a field that represents a set (prevent removals).
`self >= oldSelf`	Validate that a number is monotonically increasing.

Using the Functions Libraries

Validation rules have access to a couple different function libraries:

CEL standard functions, defined in the list of standard definitions
CEL standard macros
CEL extended string function library
Kubernetes CEL extension library which includes supplemental functions for lists, regex, and URLs.

Examples of function libraries in use:

Validation Rule	Purpose
`!(self.getDayOfWeek() in [0, 6])`	Validate that a date is not a Sunday or Saturday.
`isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']`	Validate that a URL has an allowed hostname.
`self.map(x, x.weight).sum() == 1`	Validate that the weights of a list of objects sum to 1.
`int(self.find('^[0-9]*')) < 100`	Validate that a string starts with a number less than 100.
`self.isSorted()`	Validates that a list is sorted.

Resource use and limits

To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.

Estimated cost limit

CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n^2) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.

Good practice:

Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.

Runtime cost limits for CRD validation rules

In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.

With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.

CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.

Future work

We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in an upcoming Kubernetes release!

There is a growing community of Kubernetes contributors thinking about how to make it possible to write extensible admission controllers using CEL as a substitute for admission webhooks for policy enforcement use cases. Anyone interested should reach out to us on the usual SIG API Machinery channels or via slack at #sig-api-machinery-cel-dev.

Acknowledgements

Special thanks to Cici Huang, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to Validation Rules!

https://kubernetes.io/blog/2022/09/23/crd-validation-rules-beta/

Blog: Kubernetes 1.25: Use Secrets for Node-Driven Expansion of CSI Volumes
• sorangutan

1

1
Posts

30
Views
Author: Humble Chirammal (Red Hat), Louis Koo (deeproute.ai)

Kubernetes v1.25, released earlier this month, introduced a new feature that lets your cluster expand storage volumes, even when access to those volumes requires a secret (for example: a credential for accessing a SAN fabric) to perform node expand operation. This new behavior is in alpha and you must enable a feature gate (CSINodeExpandSecret) to make use of it. You must also be using CSI storage; this change isn't relevant to storage drivers that are built in to Kubernetes.

To turn on this new, alpha feature, you enable the CSINodeExpandSecret feature gate for the kube-apiserver and kubelet, which turns on a mechanism to send secretRef configuration as part of NodeExpansion by the CSI drivers thus make use of the same to perform node side expansion operation with the underlying storage system.

What is this all about?

Before Kubernetes v1.24, you were able to define a cluster-level StorageClass that made use of StorageClass Secrets, but you didn't have any mechanism to specify the credentials that would be used for operations that take place when the storage was mounted onto a node and when the volume has to be expanded at node side.

The Kubernetes CSI already implemented a similar mechanism specific kinds of volume resizes; namely, resizes of PersistentVolumes where the resizes take place independently from any node referred as Controller Expansion. In that case, you associate a PersistentVolume with a Secret that contains credentials for volume resize actions, so that controller expansion can take place. CSI also supports a nodeExpandVolume operation which CSI drivers can make use independent of Controller Expansion or along with Controller Expansion on which, where the resize is driven from a node in your cluster where the volume is attached. Please read Kubernetes 1.24: Volume Expansion Now A Stable Feature
- At times, the CSI driver needs to check the actual size of the backend block storage (or image) before proceeding with a node-level filesystem expand operation. This avoids false positive returns from the backend storage cluster during filesystem expands.
- When a PersistentVolume represents encrypted block storage (for example using LUKS) you need to provide a passphrase in order to expand the device, and also to make it possible to grow the filesystem on that device.
- For various validations at time of node expansion, the CSI driver has to be connected to the backend storage cluster. If the nodeExpandVolume request includes a secretRef then the CSI driver can make use of the same and connect to the storage cluster to perform the cluster operations.
How does it work?

To enable this functionality from this version of Kubernetes, SIG Storage have introduced a new feature gate called CSINodeExpandSecret. Once the feature gate is enabled in the cluster, NodeExpandVolume requests can include a secretRef field. The NodeExpandVolume request is part of CSI; for example, in a request which has been sent from the Kubernetes control plane to the CSI driver.

As a cluster operator, you admin can specify these secrets as an opaque parameter in a StorageClass, the same way that you can already specify other CSI secret data. The StorageClass needs to have some CSI-specific parameters set. Here's an example of those parameters:
```
csi.storage.k8s.io/node-expand-secret-name: test-secret
csi.storage.k8s.io/node-expand-secret-namespace: default
```
If feature gates are enabled and storage class carries the above secret configuration, the CSI provisioner receives the credentials from the Secret as part of the NodeExpansion request.

CSI volumes that require secrets for online expansion will have NodeExpandSecretRef field set. If not set, the NodeExpandVolume CSI RPC call will be made without a secret.

Trying it out
1. Enable the CSINodeExpandSecret feature gate (please refer to Feature Gates).
2. Create a Secret, and then a StorageClass that uses that Secret.
Here's an example manifest for a Secret that holds credentials:
```
apiVersion: v1
kind: Secret
metadata:
 name: test-secret
 namespace: default
data:
stringData:
 username: admin
 password: t0p-Secret
```
Here's an example manifest for a StorageClass that refers to those credentials:
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
 name: csi-blockstorage-sc
parameters:
 csi.storage.k8s.io/node-expand-secret-name: test-secret  # the name of the Secret
 csi.storage.k8s.io/node-expand-secret-namespace: default  # the namespace that the Secret is in
provisioner: blockstorage.cloudprovider.example
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
```
Example output

If the PersistentVolumeClaim (PVC) was created successfully, you can see that configuration within the spec.csi field of the PersistentVolume (look for spec.csi.nodeExpandSecretRef). Check that it worked by running kubectl get persistentvolume <pv_name> -o yaml. You should see something like.
```
apiVersion: v1
kind: PersistentVolume
metadata:
 annotations:
 pv.kubernetes.io/provisioned-by: blockstorage.cloudprovider.example
 creationTimestamp: "2022-08-26T15:14:07Z"
 finalizers:
 - kubernetes.io/pv-protection
 name: pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0
 resourceVersion: "420263"
 uid: 6fa824d7-8a06-4e0c-b722-d3f897dcbd65
spec:
 accessModes:
 - ReadWriteOnce
 capacity:
 storage: 6Gi
 claimRef:
 apiVersion: v1
 kind: PersistentVolumeClaim
 name: csi-pvc
 namespace: default
 resourceVersion: "419862"
 uid: 95eb531a-d675-49f6-940b-9bc3fde83eb0
 csi:
 driver: blockstorage.cloudprovider.example
 nodeExpandSecretRef:
 name: test-secret
 namespace: default
 volumeAttributes:
 storage.kubernetes.io/csiProvisionerIdentity: 1648042783218-8081-blockstorage.cloudprovider.example
 volumeHandle: e21c7809-aabb-11ec-917a-2e2e254eb4cf
 nodeAffinity:
 required:
 nodeSelectorTerms:
 - matchExpressions:
 - key: topology.hostpath.csi/node
 operator: In
 values:
 - racknode01
 persistentVolumeReclaimPolicy: Delete
 storageClassName: csi-blockstorage-sc
 volumeMode: Filesystem
status:
 phase: Bound
```
If you then trigger online storage expansion, the kubelet passes the appropriate credentials to the CSI driver, by loading that Secret and passing the data to the storage driver.

Here's an example debug log:
```
I0330 03:29:51.966241 1 server.go:101] GRPC call: /csi.v1.Node/NodeExpandVolume
I0330 03:29:51.966261 1 server.go:105] GRPC request: {"capacity_range":{"required_bytes":7516192768},"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/blockstorage.cloudprovider.example/f7c62e6e08ce21e9b2a95c841df315ed4c25a15e91d8fcaf20e1c2305e5300ab/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_id":"e21c7809-aabb-11ec-917a-2e2e254eb4cf","volume_path":"/var/lib/kubelet/pods/bcb1b2c4-5793-425c-acf1-47163a81b4d7/volumes/kubernetes.io~csi/pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0/mount"}
I0330 03:29:51.966360 1 nodeserver.go:459] req:volume_id:"e21c7809-aabb-11ec-917a-2e2e254eb4cf" volume_path:"/var/lib/kubelet/pods/bcb1b2c4-5793-425c-acf1-47163a81b4d7/volumes/kubernetes.io~csi/pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0/mount" capacity_range:<required_bytes:7516192768 > staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/blockstorage.cloudprovider.example/f7c62e6e08ce21e9b2a95c841df315ed4c25a15e91d8fcaf20e1c2305e5300ab/globalmount" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_MULTI_WRITER > > secrets:<key:"XXXXXX" value:"XXXXX" > secrets:<key:"XXXXX" value:"XXXXXX" >
```
The future

As this feature is still in alpha, Kubernetes Storage SIG expect to update or get feedback from CSI driver authors with more tests and implementation. The community plans to eventually promote the feature to Beta in upcoming releases.

Get involved or learn more?

The enhancement proposal includes lots of detail about the history and technical implementation of this feature.

To learn more about StorageClass based dynamic provisioning in Kubernetes, please refer to Storage Classes and Persistent Volumes.

Please get involved by joining the Kubernetes Storage SIG (Special Interest Group) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

https://kubernetes.io/blog/2022/09/21/kubernetes-1-25-use-secrets-while-expanding-csi-volumes-on-node-alpha/
Blog: Kubernetes 1.25: Local Storage Capacity Isolation Reaches GA
• sorangutan

1

1
Posts

26
Views
Author: Jing Xu (Google)

Local ephemeral storage capacity isolation was introduced as a alpha feature in Kubernetes 1.7 and it went beta in 1.9. With Kubernetes 1.25 we are excited to announce general availability(GA) of this feature.

Pods use ephemeral local storage for scratch space, caching, and logs. The lifetime of local ephemeral storage does not extend beyond the life of the individual pod. It is exposed to pods using the container’s writable layer, logs directory, and EmptyDir volumes. Before this feature was introduced, there were issues related to the lack of local storage accounting and isolation, such as Pods not knowing how much local storage is available and being unable to request guaranteed local storage. Local storage is a best-effort resource and pods can be evicted due to other pods filling the local storage.

The local storage capacity isolation feature allows users to manage local ephemeral storage in the same way as managing CPU and memory. It provides support for capacity isolation of shared storage between pods, such that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of shared storage exceeds that limit. It also allows setting ephemeral storage requests for resource reservation. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.

How to use local storage capacity isolation

A typical configuration for local ephemeral storage is to place all different kinds of ephemeral local data (emptyDir volumes, writeable layers, container images, logs) into one filesystem. Typically, both /var/lib/kubelet and /var/log are on the system's root filesystem. If users configure the local storage in different ways, kubelet might not be able to correctly measure disk usage and use this feature.

Setting requests and limits for local ephemeral storage

You can specify ephemeral-storage for managing local ephemeral storage. Each container of a Pod can specify either or both of the following:
- spec.containers[].resources.limits.ephemeral-storage
- spec.containers[].resources.requests.ephemeral-storage
In the following example, the Pod has two containers. The first container has a request of 8GiB of local ephemeral storage and a limit of 12GiB. The second container requests 2GiB of local storage, but no limit setting. Therefore, the Pod requests a total of 10GiB (8GiB+2GiB) of local ephemeral storage and enforces a limit of 12GiB of local ephemeral storage. It also sets emptyDir sizeLimit to 5GiB. With this setting in pod spec, it will affect how the scheduler makes a decision on scheduling pods and also how kubelet evict pods.
```
apiVersion: v1
kind: Pod
metadata:
 name: frontend
spec:
 containers:
 - name: app
 image: images.my-company.example/app:v4
 resources:
 requests:
 ephemeral-storage: "8Gi"
 limits:
 ephemeral-storage: "12Gi"
 volumeMounts:
 - name: ephemeral
 mountPath: "/tmp"
 - name: log-aggregator
 image: images.my-company.example/log-aggregator:v6
 resources:
 requests:
 ephemeral-storage: "2Gi"
 volumeMounts:
 - name: ephemeral
 mountPath: "/tmp"
 volumes:
 - name: ephemeral
 emptyDir: {}
 sizeLimit: 5Gi
```
First of all, the scheduler ensures that the sum of the resource requests of the scheduled containers is less than the capacity of the node. In this case, the pod can be assigned to a node only if its available ephemeral storage (allocatable resource) has more than 10GiB.

Secondly, at container level, since one of the container sets resource limit, kubelet eviction manager will measure the disk usage of this container and evict the pod if the storage usage of the first container exceeds its limit (12GiB). At pod level, kubelet works out an overall Pod storage limit by adding up the limits of all the containers in that Pod. In this case, the total storage usage at pod level is the sum of the disk usage from all containers plus the Pod's emptyDirvolumes. If this total usage exceeds the overall Pod storage limit (12GiB), then the kubelet also marks the Pod for eviction.

Last, in this example, emptyDir volume sets its sizeLimit to 5Gi. It means that if this pod's emptyDir used up more local storage than 5GiB, the pod will be evicted from the node.

Setting resource quota and limitRange for local ephemeral storage

This feature adds two more resource quotas for storage. The request and limit set constraints on the total requests/limits of all containers’ in a namespace.
- requests.ephemeral-storage
- limits.ephemeral-storage
```
apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-resources
spec:
hard:
requests.ephemeral-storage: "10Gi"
limits.ephemeral-storage: "20Gi"
```
Similar to CPU and memory, admin could use LimitRange to set default container’s local storage request/limit, and/or minimum/maximum resource constraints for a namespace.

apiVersion: v1 kind: LimitRange metadata: name: storage-limit-range spec: limits:
- default: ephemeral-storage: 10Gi defaultRequest: ephemeral-storage: 5Gi type: Container
Also, ephemeral-storage may be specified to reserve for kubelet or system. example, --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=10Gi][,][pid=1000] --kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=5Gi][,][pid=1000]. If your cluster node root disk capacity is 100Gi, after setting system-reserved and kube-reserved value, the available allocatable ephemeral storage would become 85Gi. The schedule will use this information to assign pods based on request and allocatable resources from each node. The eviction manager will also use allocatable resource to determine pod eviction. See more details from Reserve Compute Resources for System Daemons

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.

We offer a huge thank you to all the contributors in Kubernetes Storage SIG and CSI community who helped review the design and implementation of the project, including but not limited to the following:
- Benjamin Elder (BenTheElder)
- Michelle Au (msau42)
- Tim Hockin (thockin)
- Jordan Liggitt (liggitt)
- Xing Yang (xing-yang)
Blog: Kubernetes 1.25: Two Features for Apps Rollouts Graduate to Stable
• sorangutan

1

1
Posts

28
Views
Authors: Ravi Gudimetla (Apple), Filip Křepinský (Red Hat), Maciej Szulik (Red Hat)

This blog describes the two features namely minReadySeconds for StatefulSets and maxSurge for DaemonSets that SIG Apps is happy to graduate to stable in Kubernetes 1.25.

Specifying minReadySeconds slows down a rollout of a StatefulSet, when using a RollingUpdate value in .spec.updateStrategy field, by waiting for each pod for a desired time. This time can be used for initializing the pod (e.g. warming up the cache) or as a delay before acknowledging the pod.

maxSurge allows a DaemonSet workload to run multiple instances of the same pod on a node during a rollout when using a RollingUpdate value in .spec.updateStrategy field. This helps to minimize the downtime of the DaemonSet for consumers.

These features were already available in a Deployment and other workloads. This graduation helps to align this functionality across the workloads.

What problems do these features solve?

minReadySeconds for StatefulSets

minReadySeconds ensures that the StatefulSet workload is Ready for the given number of seconds before reporting the pod as Available. The notion of being Ready and Available is quite important for workloads. For example, some workloads, like Prometheus with multiple instances of Alertmanager, should be considered Available only when the Alertmanager's state transfer is complete. minReadySeconds also helps when using loadbalancers with cloud providers. Since the pod should be Ready for the given number of seconds, it provides buffer time to prevent killing pods in rotation before new pods show up.

maxSurge for DaemonSets

Kubernetes system-level components like CNI, CSI are typically run as DaemonSets. These components can have impact on the availability of the workloads if those DaemonSets go down momentarily during the upgrades. The feature allows DaemonSet pods to temporarily increase their number, thereby ensuring zero-downtime for the DaemonSets.

Please note that the usage of hostPort in conjunction with maxSurge in DaemonSets is not allowed as DaemonSet pods are tied to a single node and two active pods cannot share the same port on the same node.

How does it work?

minReadySeconds for StatefulSets

The StatefulSet controller watches for the StatefulSet pods and counts how long a particular pod has been in the Running state, if this value is greater than or equal to the time specified in .spec.minReadySeconds field of the StatefulSet, the StatefulSet controller updates the AvailableReplicas field in the StatefulSet's status.

maxSurge for DaemonSets

The DaemonSet controller creates the additional pods (above the desired number resulting from DaemonSet spec) based on the value given in .spec.strategy.rollingUpdate.maxSurge. The additional pods would run on the same node where the old DaemonSet pod is running till the old pod gets killed.
- The default value is 0.
- The value cannot be 0 when MaxUnavailable is 0.
- The value can be specified either as an absolute number of pods, or a percentage (rounded up) of desired pods.
How do I use it?

minReadySeconds for StatefulSets

Specify a value for minReadySeconds for any StatefulSet and check if pods are available or not by inspecting AvailableReplicas field using:

kubectl get statefulset/<name_of_the_statefulset> -o yaml

Please note that the default value of minReadySeconds is 0.

maxSurge for DaemonSets

Specify a value for .spec.updateStrategy.rollingUpdate.maxSurge and set .spec.updateStrategy.rollingUpdate.maxUnavailable to 0.

Then observe a faster rollout and higher number of pods running at the same time in the next rollout.
```
kubectl rollout restart daemonset <name_of_the_daemonset>
kubectl get pods -w
```
How can I learn more?

minReadySeconds for StatefulSets
- Documentation: https://k8s.io/docs/concepts/workloads/controllers/statefulset/#minimum-ready-seconds
- KEP: https://github.com/kubernetes/enhancements/issues/2599
- API Changes: https://github.com/kubernetes/kubernetes/pull/100842
maxSurge for DaemonSets
- Documentation: https://k8s.io/docs/tasks/manage-daemon/update-daemon-set/
- KEP: https://github.com/kubernetes/enhancements/issues/1591
- API Changes: https://github.com/kubernetes/kubernetes/pull/96375
How do I get involved?

Please reach out to us on #sig-apps channel on Slack, or through the SIG Apps mailing list [email protected].

https://kubernetes.io/blog/2022/09/15/app-rollout-features-reach-stable/
Blog: Kubernetes 1.25: PodHasNetwork condition for pods
• sorangutan

1

1
Posts

30
Views
Author: Deep Debroy (Apple)

Kubernetes 1.25 introduces Alpha support for a new kubelet-managed pod condition in the status field of a pod: PodHasNetwork. The kubelet, for a worker node, will use the PodHasNetwork condition to accurately surface the initialization state of a pod from the perspective of pod sandbox creation and network configuration by a container runtime (typically in coordination with CNI plugins). The kubelet starts to pull container images and start individual containers (including init containers) after the status of the PodHasNetwork condition is set to True. Metrics collection services that report latency of pod initialization from a cluster infrastructural perspective (i.e. agnostic of per container characteristics like image size or payload) can utilize the PodHasNetwork condition to accurately generate Service Level Indicators (SLIs). Certain operators or controllers that manage underlying pods may utilize the PodHasNetwork condition to optimize the set of actions performed when pods repeatedly fail to come up.

How is this different from the existing Initialized condition reported for pods?

The kubelet sets the status of the existing Initialized condition reported in the status field of a pod depending on the presence of init containers in a pod.

If a pod specifies init containers, the status of the Initialized condition in the pod status will not be set to True until all init containers for the pod have succeeded. However, init containers, configured by users, may have errors (payload crashing, invalid image, etc) and the number of init containers configured in a pod may vary across different workloads. Therefore, cluster-wide, infrastructural SLIs around pod initialization cannot depend on the Initialized condition of pods.

If a pod does not specify init containers, the status of the Initialized condition in the pod status is set to True very early in the lifecycle of the pod. This occurs before the kubelet initiates any pod runtime sandbox creation and network configuration steps. As a result, a pod without init containers will report the status of the Initialized condition as True even if the container runtime is not able to successfully initialize the pod sandbox environment.

Relative to either situation above, the PodHasNetwork condition surfaces more accurate data around when the pod runtime sandbox was initialized with networking configured so that the kubelet can proceed to launch user-configured containers (including init containers) in the pod.

Note that a node agent may dynamically re-configure network interface(s) for a pod by watching changes in pod annotations that specify additional networking configuration (e.g. k8s.v1.cni.cncf.io/networks). Dynamic updates of pod networking configuration after the pod sandbox is initialized by Kubelet (in coordination with a container runtime) are not reflected by the PodHasNetwork condition.

Try out the PodHasNetwork condition for pods

In order to have the kubelet report the PodHasNetwork condition in the status field of a pod, please enable the PodHasNetworkCondition feature gate on the kubelet.

For a pod whose runtime sandbox has been successfully created and has networking configured, the kubelet will report the PodHasNetwork condition with status set to True:
```
$ kubectl describe pod nginx1
Name: nginx1
Namespace: default
...
Conditions:
Type Status
PodHasNetwork True
Initialized True
Ready True
ContainersReady True
PodScheduled True
```
For a pod whose runtime sandbox has not been created yet (and networking not configured either), the kubelet will report the PodHasNetwork condition with status set to False:
```
$ kubectl describe pod nginx2
Name: nginx2
Namespace: default
...
Conditions:
Type Status
PodHasNetwork False
Initialized True
Ready False
ContainersReady False
PodScheduled True
```
What’s next?

Depending on feedback and adoption, the Kubernetes team plans to push the reporting of the PodHasNetwork condition to Beta in 1.26 or 1.27.

How can I learn more?

Please check out the documentation for the PodHasNetwork condition to learn more about it and how it fits in relation to other pod conditions.

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Acknowledgements

We want to thank the following people for their insightful and helpful reviews of the KEP and PRs around this feature: Derek Carr (@derekwaynecarr), Mrunal Patel (@mrunalp), Dawn Chen (@dchen1107), Qiutong Song (@qiutongs), Ruiwen Zhao (@ruiwen-zhao), Tim Bannister (@sftim), Danielle Lancashire (@endocrimes) and Agam Dua (@agamdua).

https://kubernetes.io/blog/2022/09/14/pod-has-network-condition/
Blog: Announcing the Auto-refreshing Official Kubernetes CVE Feed
• sorangutan

1

1
Posts

34
Views
Author: Pushkar Joglekar (VMware)

A long-standing request from the Kubernetes community has been to have a programmatic way for end users to keep track of Kubernetes security issues (also called "CVEs", after the database that tracks public security issues across different products and vendors). Accompanying the release of Kubernetes v1.25, we are excited to announce availability of such a feed as an alpha feature. This blog will cover the background and scope of this new service.

Motivation

With the growing number of eyes on Kubernetes, the number of CVEs related to Kubernetes have increased. Although most CVEs that directly, indirectly, or transitively impact Kubernetes are regularly fixed, there is no single place for the end users of Kubernetes to programmatically subscribe or pull the data of fixed CVEs. Current options are either broken or incomplete.

Scope

What This Does

Create a periodically auto-refreshing, human and machine-readable list of official Kubernetes CVEs

What This Doesn't Do
- Triage and vulnerability disclosure will continue to be done by SRC (Security Response Committee).
- Listing CVEs that are identified in build time dependencies and container images are out of scope.
- Only official CVEs announced by the Kubernetes SRC will be published in the feed.
Who It's For
- End Users: Persons or teams who use Kubernetes to deploy applications they own
- Platform Providers: Persons or teams who manage Kubernetes clusters
- Maintainers: Persons or teams who create and support Kubernetes releases through their work in Kubernetes Community - via various Special Interest Groups and Committees.
What's Next?

In order to graduate this feature, SIG Security is gathering feedback from end users who are using this alpha feed.

So in order to improve the feed in future Kubernetes Releases, if you have any feedback, please let us know by adding a comment to this tracking issue or let us know on #sig-security-tooling Kubernetes Slack channel. (Join Kubernetes Slack here)

A special shout out and massive thanks to Neha Lohia (@nehalohia27) and Tim Bannister (@sftim) for their stellar collaboration for many months from "ideation to implementation" of this feature.

https://kubernetes.io/blog/2022/09/12/k8s-cve-feed-alpha/
Blog: Kubernetes 1.25: KMS V2 Improvements
• sorangutan

1

1
Posts

24
Views
Authors: Anish Ramasekar, Rita Zhang, Mo Khan, and Xander Grzywinski (Microsoft)

With Kubernetes v1.25, SIG Auth is introducing a new v2alpha1 version of the Key Management Service (KMS) API. There are a lot of improvements in the works, and we're excited to be able to start down the path of a new and improved KMS!

What is KMS?

One of the first things to consider when securing a Kubernetes cluster is encrypting persisted API data at rest. KMS provides an interface for a provider to utilize a key stored in an external key service to perform this encryption.

Encryption at rest using KMS v1 has been a feature of Kubernetes since version v1.10, and is currently in beta as of version v1.12.

What’s new in v2alpha1?

While the original v1 implementation has been successful in helping Kubernetes users encrypt etcd data, it did fall short in a few key ways:
1. Performance: When starting a cluster, all resources are serially fetched and decrypted to fill the kube-apiserver cache. When using a KMS plugin, this can cause slow startup times due to the large number of requests made to the remote vault. In addition, there is the potential to hit API rate limits on external key services depending on how many encrypted resources exist in the cluster.
2. Key Rotation: With KMS v1, rotation of a key-encrypting key is a manual and error-prone process. It can be difficult to determine what encryption keys are in-use on a cluster.
3. Health Check & Status: Before the KMS v2 API, the kube-apiserver was forced to make encrypt and decrypt calls as a proxy to determine if the KMS plugin is healthy. With cloud services these operations usually cost actual money with cloud service. Whatever the cost, those operations on their own do not provide a holistic view of the service's health.
4. Observability: Without some kind of trace ID, it's has been difficult to correlate events found in the various logs across kube-apiserver, KMS, and KMS plugins.
The KMS v2 enhancement attempts to address all of these shortcomings, though not all planned features are implemented in the initial alpha release. Here are the improvements that arrived in Kubernetes v1.25:
1. Support for KMS plugins that use a key hierarchy to reduce network requests made to the remote vault. To learn more, check out the design details for how a KMS plugin can leverage key hierarchy.
2. Extra metadata is now tracked to allow a KMS plugin to communicate what key it is currently using with the kube-apiserver, allowing for rotation without API server restart. Data stored in etcd follows a more standard proto format to allow external tools to observe its state. To learn more, check out the details for metadata.
3. A dedicated status API is used to communicate the health of the KMS plugin with the API server. To learn more, check out the details for status API.
4. To improve observability, a new UID field is included in EncryptRequest and DecryptRequest of the v2 API. The UID is generated for each envelope operation. To learn more, check out the details for observability.
Sequence Diagram

Encrypt Request

Decrypt Request

What’s next?

For Kubernetes v1.26, we expect to ship another alpha version. As of right now, the alpha API will be ready to be used by KMS plugin authors. We hope to include a reference plugin implementation with the next release, and you'll be able to try out the feature at that time.

You can learn more about KMS v2 by reading Using a KMS provider for data encryption. You can also follow along on the KEP to track progress across the coming Kubernetes releases.

How to get involved

If you are interested in getting involved in the development of this feature or would like to share feedback, please reach out on the #sig-auth-kms-dev channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every-other Wednesday.

Acknowledgements

This feature has been an effort driven by contributors from several different companies. We would like to extend a huge thank you to everyone that contributed their time and effort to help make this possible.

https://kubernetes.io/blog/2022/09/09/kms-v2-improvements/
Blog: Kubernetes’s IPTables Chains Are Not API
• sorangutan

1

1
Posts

33
Views
Author: Dan Winship (Red Hat)

Some Kubernetes components (such as kubelet and kube-proxy) create iptables chains and rules as part of their operation. These chains were never intended to be part of any Kubernetes API/ABI guarantees, but some external components nonetheless make use of some of them (in particular, using KUBE-MARK-MASQ to mark packets as needing to be masqueraded).

As a part of the v1.25 release, SIG Network made this declaration explicit: that (with one exception), the iptables chains that Kubernetes creates are intended only for Kubernetes’s own internal use, and third-party components should not assume that Kubernetes will create any specific iptables chains, or that those chains will contain any specific rules if they do exist.

Then, in future releases, as part of KEP-3178, we will begin phasing out certain chains that Kubernetes itself no longer needs. Components outside of Kubernetes itself that make use of KUBE-MARK-MASQ, KUBE-MARK-DROP, or other Kubernetes-generated iptables chains should start migrating away from them now.

Background

In addition to various service-specific iptables chains, kube-proxy creates certain general-purpose iptables chains that it uses as part of service proxying. In the past, kubelet also used iptables for a few features (such as setting up hostPort mapping for pods) and so it also redundantly created some of the same chains.

However, with the removal of dockershim in Kubernetes in 1.24, kubelet now no longer ever uses any iptables rules for its own purposes; the things that it used to use iptables for are now always the responsibility of the container runtime or the network plugin, and there is no reason for kubelet to be creating any iptables rules.

Meanwhile, although iptables is still the default kube-proxy backend on Linux, it is unlikely to remain the default forever, since the associated command-line tools and kernel APIs are essentially deprecated, and no longer receiving improvements. (RHEL 9 logs a warning if you use the iptables API, even via iptables-nft.)

Although as of Kubernetes 1.25 iptables kube-proxy remains popular, and kubelet continues to create the iptables rules that it historically created (despite no longer using them), third party software cannot assume that core Kubernetes components will keep creating these rules in the future.

Upcoming changes

Starting a few releases from now, kubelet will no longer create the following iptables chains in the nat table:
- KUBE-MARK-DROP
- KUBE-MARK-MASQ
- KUBE-POSTROUTING
Additionally, the KUBE-FIREWALL chain in the filter table will no longer have the functionality currently associated with KUBE-MARK-DROP (and it may eventually go away entirely).

This change will be phased in via the IPTablesOwnershipCleanup feature gate. That feature gate is available and can be manually enabled for testing in Kubernetes 1.25. The current plan is that it will become enabled-by-default in Kubernetes 1.27, though this may be delayed to a later release. (It will not happen sooner than Kubernetes 1.27.)

What to do if you use Kubernetes’s iptables chains

(Although the discussion below focuses on short-term fixes that are still based on iptables, you should probably also start thinking about eventually migrating to nftables or another API).

If you use KUBE-MARK-MASQ...

If you are making use of the KUBE-MARK-MASQ chain to cause packets to be masqueraded, you have two options: (1) rewrite your rules to use -j MASQUERADE directly, (2) create your own alternative “mark for masquerade” chain.

The reason kube-proxy uses KUBE-MARK-MASQ is because there are lots of cases where it needs to call both -j DNAT and -j MASQUERADE on a packet, but it’s not possible to do both of those at the same time in iptables; DNAT must be called from the PREROUTING (or OUTPUT) chain (because it potentially changes where the packet will be routed to) while MASQUERADE must be called from POSTROUTING (because the masqueraded source IP that it picks depends on what the final routing decision was).

In theory, kube-proxy could have one set of rules to match packets in PREROUTING/OUTPUT and call -j DNAT, and then have a second set of rules to match the same packets in POSTROUTING and call -j MASQUERADE. But instead, for efficiency, it only matches them once, during PREROUTING/OUTPUT, at which point it calls -j DNAT and then calls -j KUBE-MARK-MASQ to set a bit on the kernel packet mark as a reminder to itself. Then later, during POSTROUTING, it has a single rule that matches all previously-marked packets, and calls -j MASQUERADE on them.

If you have a lot of rules where you need to apply both DNAT and masquerading to the same packets like kube-proxy does, then you may want a similar arrangement. But in many cases, components that use KUBE-MARK-MASQ are only doing it because they copied kube-proxy’s behavior without understanding why kube-proxy was doing it that way. Many of these components could easily be rewritten to just use separate DNAT and masquerade rules. (In cases where no DNAT is occurring then there is even less point to using KUBE-MARK-MASQ; just move your rules from PREROUTING to POSTROUTING and call -j MASQUERADE directly.)

If you use KUBE-MARK-DROP...

The rationale for KUBE-MARK-DROP is similar to the rationale for KUBE-MARK-MASQ: kube-proxy wanted to make packet-dropping decisions alongside other decisions in the nat KUBE-SERVICES chain, but you can only call -j DROP from the filter table. So instead, it uses KUBE-MARK-DROP to mark packets to be dropped later on.

In general, the approach for removing a dependency on KUBE-MARK-DROP is the same as for removing a dependency on KUBE-MARK-MASQ. In kube-proxy’s case, it is actually quite easy to replace the usage of KUBE-MARK-DROP in the nat table with direct calls to DROP in the filter table, because there are no complicated interactions between DNAT rules and drop rules, and so the drop rules can simply be moved from nat to filter.

In more complicated cases, it might be necessary to “re-match” the same packets in both nat and filter.

If you use Kubelet’s iptables rules to figure out iptables-legacy vs iptables-nft...

Components that manipulate host-network-namespace iptables rules from inside a container need some way to figure out whether the host is using the old iptables-legacy binaries or the newer iptables-nft binaries (which talk to a different kernel API underneath).

The iptables-wrappers module provides a way for such components to autodetect the system iptables mode, but in the past it did this by assuming that Kubelet will have created “a bunch” of iptables rules before any containers start, and so it can guess which mode the iptables binaries in the host filesystem are using by seeing which mode has more rules defined.

In future releases, Kubelet will no longer create many iptables rules, so heuristics based on counting the number of rules present may fail.

However, as of 1.24, Kubelet always creates a chain named KUBE-IPTABLES-HINT in the mangle table of whichever iptables subsystem it is using. Components can now look for this specific chain to know which iptables subsystem Kubelet (and thus, presumably, the rest of the system) is using.

(Additionally, since Kubernetes 1.17, kubelet has created a chain called KUBE-KUBELET-CANARY in the mangle table. While this chain may go away in the future, it will of course still be there in older releases, so in any recent version of Kubernetes, at least one of KUBE-IPTABLES-HINT or KUBE-KUBELET-CANARY will be present.)

The iptables-wrappers package has already been updated with this new heuristic, so if you were previously using that, you can rebuild your container images with an updated version of that.

Further reading

The project to clean up iptables chain ownership and deprecate the old chains is tracked by KEP-3178.

https://kubernetes.io/blog/2022/09/07/iptables-chains-not-api/
Blog: Introducing COSI: Object Storage Management using Kubernetes APIs
• sorangutan

1

1
Posts

28
Views
Authors: Sidhartha Mani (Minio, Inc)

This article introduces the Container Object Storage Interface (COSI), a standard for provisioning and consuming object storage in Kubernetes. It is an alpha feature in Kubernetes v1.25.

File and block storage are treated as first class citizens in the Kubernetes ecosystem via Container Storage Interface (CSI). Workloads using CSI volumes enjoy the benefits of portability across vendors and across Kubernetes clusters without the need to change application manifests. An equivalent standard does not exist for Object storage.

Object storage has been rising in popularity in recent years as an alternative form of storage to filesystems and block devices. Object storage paradigm promotes disaggregation of compute and storage. This is done by making data available over the network, rather than locally. Disaggregated architectures allow compute workloads to be stateless, which consequently makes them easier to manage, scale and automate.

COSI

COSI aims to standardize consumption of object storage to provide the following benefits:
- Kubernetes Native - Use the Kubernetes API to provision, configure and manage buckets
- Self Service - A clear delineation between administration and operations (DevOps) to enable self-service capability for DevOps personnel
- Portability - Vendor neutrality enabled through portability across Kubernetes Clusters and across Object Storage vendors
Portability across vendors is only possible when both vendors support a common datapath-API. Eg. it is possible to port from AWS S3 to Ceph, or AWS S3 to MinIO and back as they all use S3 API. In contrast, it is not possible to port from AWS S3 and Google Cloud’s GCS or vice versa.

Architecture

COSI is made up of three components:
- COSI Controller Manager
- COSI Sidecar
- COSI Driver
The COSI Controller Manager acts as the main controller that processes changes to COSI API objects. It is responsible for fielding requests for bucket creation, updates, deletion and access management. One instance of the controller manager is required per kubernetes cluster. Only one is needed even if multiple object storage providers are used in the cluster.

The COSI Sidecar acts as a translator between COSI API requests and vendor-specific COSI Drivers. This component uses a standardized gRPC protocol that vendor drivers are expected to satisfy.

The COSI Driver is the vendor specific component that receives requests from the sidecar and calls the appropriate vendor APIs to create buckets, manage their lifecycle and manage access to them.

API

The COSI API is centered around buckets, since bucket is the unit abstraction for object storage. COSI defines three Kubernetes APIs aimed at managing them
- Bucket
- BucketClass
- BucketClaim
In addition, two more APIs for managing access to buckets are also defined:
- BucketAccess
- BucketAccessClass
In a nutshell, Bucket and BucketClaim can be considered to be similar to PersistentVolume and PersistentVolumeClaim respectively. The BucketClass’ counterpart in the file/block device world is StorageClass.

Since Object Storage is always authenticated, and over the network, access credentials are required to access buckets. The two APIs, namely, BucketAccess and BucketAccessClass are used to denote access credentials and policies for authentication. More info about these APIs can be found in the official COSI proposal - https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1979-object-storage-support

Self-Service

Other than providing kubernetes-API driven bucket management, COSI also aims to empower DevOps personnel to provision and manage buckets on their own, without admin intervention. This, further enabling dev teams to realize faster turn-around times and faster time-to-market.

COSI achieves this by dividing bucket provisioning steps among two different stakeholders, namely the administrator (admin), and the cluster operator. The administrator will be responsible for setting broad policies and limits on how buckets are provisioned, and how access is obtained for them. The cluster operator will be free to create and utilize buckets within the limits set by the admin.

For example, a cluster operator could use an admin policy could be used to restrict maximum provisioned capacity to 100GB, and developers would be allowed to create buckets and store data upto that limit. Similarly for access credentials, admins would be able to restrict who can access which buckets, and developers would be able to access all the buckets available to them.

Portability

The third goal of COSI is to achieve vendor neutrality for bucket management. COSI enables two kinds of portability:
- Cross Cluster
- Cross Provider
Cross Cluster portability is allowing buckets provisioned in one cluster to be available in another cluster. This is only valid when the object storage backend itself is accessible from both clusters.

Cross-provider portability is about allowing organizations or teams to move from one object storage provider to another seamlessly, and without requiring changes to application definitions (PodTemplates, StatefulSets, Deployment and so on). This is only possible if the source and destination providers use the same data.

COSI does not handle data migration as it is outside of its scope. In case porting between providers requires data to be migrated as well, then other measures need to be taken to ensure data availability.

What’s next

The amazing sig-storage-cosi community has worked hard to bring the COSI standard to alpha status. We are looking forward to onboarding a lot of vendors to write COSI drivers and become COSI compatible!

We want to add more authentication mechanisms for COSI buckets, we are designing advanced bucket sharing primitives, multi-cluster bucket management and much more. Lots of great ideas and opportunities ahead!

Stay tuned for what comes next, and if you have any questions, comments or suggestions
- Chat with us on the Kubernetes Slack:#sig-storage-cosi
- Join our Zoom meeting, every Thursday at 10:00 Pacific Time
- Participate in the bucket API proposal PR to add your ideas, suggestions and more.
https://kubernetes.io/blog/2022/09/02/cosi-kubernetes-object-storage-management/
Blog: Kubernetes 1.25: cgroup v2 graduates to GA
• sorangutan

1

1
Posts

26
Views
Authors:: David Porter (Google), Mrunal Patel (Red Hat)

Kubernetes 1.25 brings cgroup v2 to GA (general availability), letting the kubelet use the latest container resource management capabilities.

What are cgroups?

Effective resource management is a critical aspect of Kubernetes. This involves managing the finite resources in your nodes, such as CPU, memory, and storage.

cgroups are a Linux kernel capability that establish resource management functionality like limiting CPU usage or setting memory limits for running processes.

When you use the resource management capabilities in Kubernetes, such as configuring requests and limits for Pods and containers, Kubernetes uses cgroups to enforce your resource requests and limits.

The Linux kernel offers two versions of cgroups: cgroup v1 and cgroup v2.

What is cgroup v2?

cgroup v2 is the latest version of the Linux cgroup API. cgroup v2 provides a unified control system with enhanced resource management capabilities.

cgroup v2 has been development in the Linux Kernel since 2016 and in recent years has matured across the container ecosystem. With Kubernetes 1.25, cgroup v2 support has graduated to general availability.

Many recent releases of Linux distributions have switched over to cgroup v2 by default so it's important that Kubernetes continues to work well on these new updated distros.

cgroup v2 offers several improvements over cgroup v1, such as the following:
- Single unified hierarchy design in API
- Safer sub-tree delegation to containers
- Newer features like Pressure Stall Information
- Enhanced resource allocation management and isolation across multiple resources
  
  Unified accounting for different types of memory allocations (network and kernel memory, etc)
  
  Accounting for non-immediate resource changes such as page cache write backs
Some Kubernetes features exclusively use cgroup v2 for enhanced resource management and isolation. For example, the MemoryQoS feature improves memory utilization and relies on cgroup v2 functionality to enable it. New resource management features in the kubelet will also take advantage of the new cgroup v2 features moving forward.

How do you use cgroup v2?

Many Linux distributions are switching to cgroup v2 by default; you might start using it the next time you update the Linux version of your control plane and nodes!

Using a Linux distribution that uses cgroup v2 by default is the recommended method. Some of the popular Linux distributions that use cgroup v2 include the following:
- Container Optimized OS (since M97)
- Ubuntu (since 21.10)
- Debian GNU/Linux (since Debian 11 Bullseye)
- Fedora (since 31)
- Arch Linux (since April 2021)
- RHEL and RHEL-like distributions (since 9)
To check if your distribution uses cgroup v2 by default, refer to Check your cgroup version or consult your distribution's documentation.

If you're using a managed Kubernetes offering, consult your provider to determine how they're adopting cgroup v2, and whether you need to take action.

To use cgroup v2 with Kubernetes, you must meet the following requirements:
- Your Linux distribution enables cgroup v2 on kernel version 5.8 or later
- Your container runtime supports cgroup v2. For example:
  
  containerd v1.4 or later
  
  cri-o v1.20 or later
- The kubelet and the container runtime are configured to use the systemd cgroup driver
The kubelet and container runtime use a cgroup driver to set cgroup paramaters. When using cgroup v2, it's strongly recommended that both the kubelet and your container runtime use the systemd cgroup driver, so that there's a single cgroup manager on the system. To configure the kubelet and the container runtime to use the driver, refer to the systemd cgroup driver documentation.

Migrate to cgroup v2

When you run Kubernetes with a Linux distribution that enables cgroup v2, the kubelet should automatically adapt without any additional configuration required, as long as you meet the requirements.

In most cases, you won't see a difference in the user experience when you switch to using cgroup v2 unless your users access the cgroup file system directly.

If you have applications that access the cgroup file system directly, either on the node or from inside a container, you must update the applications to use the cgroup v2 API instead of the cgroup v1 API.

Scenarios in which you might need to update to cgroup v2 include the following:
- If you run third-party monitoring and security agents that depend on the cgroup file system, update the agents to versions that support cgroup v2.
- If you run cAdvisor as a stand-alone DaemonSet for monitoring pods and containers, update it to v0.43.0 or later.
- If you deploy Java applications with the JDK, prefer to use JDK 11.0.16 and later or JDK 15 and later, which fully support cgroup v2.
Learn more
- Read the Kubernetes cgroup v2 documentation
- Read the enhancement proposal, KEP 2254
- Learn more about cgroups on Linux Manual Pages and cgroup v2 on the Linux Kernel documentation
Get involved

Your feedback is always welcome! SIG Node meets regularly and are available in the #sig-node channel in the Kubernetes Slack, or using the SIG mailing list.

cgroup v2 has had a long journey and is a great example of open source community collaboration across the industry because it required work across the stack, from the Linux Kernel to systemd to various container runtimes, and (of course) Kubernetes.

Acknowledgments

We would like to thank Giuseppe Scrivano who initiated cgroup v2 support in Kubernetes, and reviews and leadership from the SIG Node community including chairs Dawn Chen and Derek Carr.

We'd also like to thank the maintainers of container runtimes like Docker, containerd and CRI-O, and the maintainers of components like cAdvisor and runc, libcontainer, which underpin many container runtimes. Finally, this wouldn't have been possible without support from systemd and upstream Linux Kernel maintainers.

It's a team effort!

https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/
Blog: Kubernetes 1.25: CSI Inline Volumes have graduated to GA
• sorangutan

1

1
Posts

37
Views
Author: Jonathan Dobson (Red Hat)

CSI Inline Volumes were introduced as an alpha feature in Kubernetes 1.15 and have been beta since 1.16. We are happy to announce that this feature has graduated to General Availability (GA) status in Kubernetes 1.25.

CSI Inline Volumes are similar to other ephemeral volume types, such as configMap, downwardAPI and secret. The important difference is that the storage is provided by a CSI driver, which allows the use of ephemeral storage provided by third-party vendors. The volume is defined as part of the pod spec and follows the lifecycle of the pod, meaning the volume is created once the pod is scheduled and destroyed when the pod is destroyed.

What's new in 1.25?

There are a couple of new bug fixes related to this feature in 1.25, and the CSIInlineVolume feature gate has been locked to True with the graduation to GA. There are no new API changes, so users of this feature during beta should not notice any significant changes aside from these bug fixes.
- #89290 - CSI inline volumes should support fsGroup
- #79980 - CSI volume reconstruction does not work for ephemeral volumes
When to use this feature

CSI inline volumes are meant for simple local volumes that should follow the lifecycle of the pod. They may be useful for providing secrets, configuration data, or other special-purpose storage to the pod from a CSI driver.

A CSI driver is not suitable for inline use when:
- The volume needs to persist longer than the lifecycle of a pod
- Volume snapshots, cloning, or volume expansion are required
- The CSI driver requires volumeAttributes that should be restricted to an administrator
How to use this feature

In order to use this feature, the CSIDriver spec must explicitly list Ephemeral as one of the supported volumeLifecycleModes. Here is a simple example from the Secrets Store CSI Driver.
```
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: secrets-store.csi.k8s.io
spec:
podInfoOnMount: true
attachRequired: false
volumeLifecycleModes:
- Ephemeral
```
Any pod spec may then reference that CSI driver to create an inline volume, as in this example.
```
kind: Pod
apiVersion: v1
metadata:
name: my-csi-app-inline
spec:
containers:
- name: my-frontend
image: busybox
volumeMounts:
- name: secrets-store-inline
mountPath: "/mnt/secrets-store"
readOnly: true
command: [ "sleep", "1000000" ]
volumes:
- name: secrets-store-inline
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "my-provider"
```
If the driver supports any volume attributes, you can provide these as part of the spec for the Pod as well:
```
 csi:
driver: block.csi.vendor.example
volumeAttributes:
foo: bar
```
Example Use Cases

Two existing CSI drivers that support the Ephemeral volume lifecycle mode are the Secrets Store CSI Driver and the Cert-Manager CSI Driver.

The Secrets Store CSI Driver allows users to mount secrets from external secret stores into a pod as an inline volume. This can be useful when the secrets are stored in an external managed service or Vault instance.

The Cert-Manager CSI Driver works along with cert-manager to seamlessly request and mount certificate key pairs into a pod. This allows the certificates to be renewed and updated in the application pod automatically.

Security Considerations

Special consideration should be given to which CSI drivers may be used as inline volumes. volumeAttributes are typically controlled through the StorageClass, and may contain attributes that should remain restricted to the cluster administrator. Allowing a CSI driver to be used for inline ephmeral volumes means that any user with permission to create pods may also provide volumeAttributes to the driver through a pod spec.

Cluster administrators may choose to omit (or remove) Ephemeral from volumeLifecycleModes in the CSIDriver spec to prevent the driver from being used as an inline ephemeral volume, or use an admission webhook to restrict how the driver is used.

References

For more information on this feature, see:
https://kubernetes.io/blog/2022/08/29/csi-inline-volumes-ga/
Blog: Kubernetes v1.25: Pod Security Admission Controller in Stable
• sorangutan

1

1
Posts

27
Views
Authors: Tim Allclair (Google), Sam Stoelinga (Google)

The release of Kubernetes v1.25 marks a major milestone for Kubernetes out-of-the-box pod security controls: Pod Security admission (PSA) graduated to stable, and Pod Security Policy (PSP) has been removed. PSP was deprecated in Kubernetes v1.21, and no longer functions in Kubernetes v1.25 and later.

The Pod Security admission controller replaces PodSecurityPolicy, making it easier to enforce predefined Pod Security Standards by simply adding a label to a namespace. The Pod Security Standards are maintained by the K8s community, which means you automatically get updated security policies whenever new security-impacting Kubernetes features are introduced.

What’s new since Beta?

Pod Security Admission hasn’t changed much since the Beta in Kubernetes v1.23. The focus has been on improving the user experience, while continuing to maintain a high quality bar.

Improved violation messages

We improved violation messages so that you get fewer duplicate messages. For example, instead of the following message when the Baseline and Restricted policies check the same capability:
```
pods "admin-pod" is forbidden: violates PodSecurity "restricted:latest": non-default capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add), unrestricted capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add)
```
You get this message:
```
pods "admin-pod" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add)
```
Improved namespace warnings

When you modify the enforce Pod Security labels on a namespace, the Pod Security admission controller checks all existing pods for violations and surfaces a warning if any are out of compliance. These warnings are now aggregated for pods with identical violations, making large namespaces with many replicas much more manageable. For example:
```
Warning: frontend-h23gf2: allowPrivilegeEscalation != false
Warning: myjob-g342hj (and 6 other pods): host namespaces, allowPrivilegeEscalation != false Warning: backend-j23h42 (and 1 other pod): non-default capabilities, unrestricted capabilities
```
Additionally, when you apply a non-privileged label to a namespace that has been configured to be exempt, you will now get a warning alerting you to this fact:
```
Warning: namespace 'kube-system' is exempt from Pod Security, and the policy (enforce=baseline:latest) will be ignored
```
Changes to the Pod Security Standards

The Pod Security Standards, which Pod Security admission enforces, have been updated with support for the new Pod OS field. In v1.25 and later, if you use the Restricted policy, the following Linux-specific restrictions will no longer be required if you explicitly set the pod's .spec.os.name field to windows:
- Seccomp - The seccompProfile.type field for Pod and container security contexts
- Privilege escalation - The allowPrivilegeEscalation field on container security contexts
- Capabilities - The requirement to drop ALL capabilities in the capabilities field on containers
In Kubernetes v1.23 and earlier, the kubelet didn't enforce the Pod OS field. If your cluster includes nodes running a v1.23 or older kubelet, you should explicitly pin Restricted policies to a version prior to v1.25.

Migrating from PodSecurityPolicy to the Pod Security admission controller

For instructions to migrate from PodSecurityPolicy to the Pod Security admission controller, and for help choosing a migration strategy, refer to the migration guide. We're also developing a tool called pspmigrator to automate parts of the migration process.

We'll be talking about PSP migration in more detail at our upcoming KubeCon 2022 NA talk, Migrating from Pod Security Policy. Use the KubeCon NA schedule to learn more.

https://kubernetes.io/blog/2022/08/25/pod-security-admission-stable/
Blog: PodSecurityPolicy: The Historical Context
• sorangutan

1

1
Posts

34
Views
Author: Mahé Tardy (Quarkslab)

The PodSecurityPolicy (PSP) admission controller has been removed, as of Kubernetes v1.25. Its deprecation was announced and detailed in the blog post PodSecurityPolicy Deprecation: Past, Present, and Future, published for the Kubernetes v1.21 release.

This article aims to provide historical context on the birth and evolution of PSP, explain why the feature never made it to stable, and show why it was removed and replaced by Pod Security admission control.

PodSecurityPolicy, like other specialized admission control plugins, provided fine-grained permissions on specific fields concerning the pod security settings as a built-in policy API. It acknowledged that cluster administrators and cluster users are usually not the same people, and that creating workloads in the form of a Pod or any resource that will create a Pod should not equal being "root on the cluster". It could also encourage best practices by configuring more secure defaults through mutation and decoupling low-level Linux security decisions from the deployment process.

The birth of PodSecurityPolicy

PodSecurityPolicy originated from OpenShift's SecurityContextConstraints (SCC) that were in the very first release of the Red Hat OpenShift Container Platform, even before Kubernetes 1.0. PSP was a stripped-down version of the SCC.

The origin of the creation of PodSecurityPolicy is difficult to track, notably because it was mainly added before Kubernetes Enhancements Proposal (KEP) process, when design proposals were still a thing. Indeed, the archive of the final design proposal is still available. Nevertheless, a KEP issue number five was created after the first pull requests were merged.

Before adding the first piece of code that created PSP, two main pull requests were merged into Kubernetes, a SecurityContext subresource that defined new fields on pods' containers, and the first iteration of the ServiceAccount API.

Kubernetes 1.0 was released on 10 July 2015 without any mechanism to restrict the security context and sensitive options of workloads, other than an alpha-quality SecurityContextDeny admission plugin (then known as scdeny). The SecurityContextDeny plugin is still in Kubernetes today (as an alpha feature) and creates an admission controller that prevents the usage of some fields in the security context.

The roots of the PodSecurityPolicy were added with the very first pull request on security policy, which added the design proposal with the new PSP object, based on the SCC (Security Context Constraints). It was a long discussion of nine months, with back and forth from OpenShift's SCC, many rebases, and the rename to PodSecurityPolicy that finally made it to upstream Kubernetes in February 2016. Now that the PSP object had been created, the next step was to add an admission controller that could enforce these policies. The first step was to add the admission without taking into account the users or groups. A specific issue to bring PodSecurityPolicy to a usable state was added to keep track of the progress and a first version of the admission controller was merged in pull request named PSP admission in May 2016. Then around two months later, Kubernetes 1.3 was released.

Here is a timeline that recaps the main pull requests of the birth of the PodSecurityPolicy and its admission controller with 1.0 and 1.3 releases as reference points.

After that, the PSP admission controller was enhanced by adding what was initially left aside. The authorization mechanism, merged in early November 2016 allowed administrators to use multiple policies in a cluster to grant different levels of access for different types of users. Later, a pull request merged in October 2017 fixed a design issue on ordering PodSecurityPolicies between mutating and alphabetical order, and continued to build the PSP admission as we know it. After that, many improvements and fixes followed to build the PodSecurityPolicy feature of recent Kubernetes releases.

The rise of Pod Security Admission

Despite the crucial issue it was trying to solve, PodSecurityPolicy presented some major flaws:
- Flawed authorization model - users can create a pod if they have the use verb on the PSP that allows that pod or the pod's service account has the use permission on the allowing PSP.
- Difficult to roll out - PSP fail-closed. That is, in the absence of a policy, all pods are denied. It mostly means that it cannot be enabled by default and that users have to add PSPs for all workloads before enabling the feature, thus providing no audit mode to discover which pods would not be allowed by the new policy. The opt-in model also leads to insufficient test coverage and frequent breakage due to cross-feature incompatibility. And unlike RBAC, there was no strong culture of shipping PSP manifests with projects.
- Inconsistent unbounded API - the API has grown with lots of inconsistencies notably because of many requests for niche use cases: e.g. labels, scheduling, fine-grained volume controls, etc. It has poor composability with a weak prioritization model, leading to unexpected mutation priority. It made it really difficult to combine PSP with other third-party admission controllers.
- Require security knowledge - effective usage still requires an understanding of Linux security primitives. e.g. MustRunAsNonRoot + AllowPrivilegeEscalation.
The experience with PodSecurityPolicy concluded that most users care for two or three policies, which led to the creation of the Pod Security Standards, that define three policies:
- Privileged - unrestricted policy.
- Baseline - minimally restrictive policy, allowing the default pod configuration.
- Restricted - security best practice policy.
The replacement for PSP, the new Pod Security Admission is an in-tree, stable for Kubernetes v1.25, admission plugin to enforce these standards at the namespace level. It makes it easier to enforce basic pod security without deep security knowledge. For more sophisticated use cases, you might need a third-party solution that can be easily combined with Pod Security Admission.

What's next

For further details on the SIG Auth processes, covering PodSecurityPolicy removal and creation of Pod Security admission, the SIG auth update at KubeCon NA 2019 and the PodSecurityPolicy Replacement: Past, Present, and Future presentation at KubeCon NA 2021 records are available.

Particularly on the PSP removal, the PodSecurityPolicy Deprecation: Past, Present, and Future blog post is still accurate.

And for the new Pod Security admission, documentation is available. In addition, the blog post Kubernetes 1.23: Pod Security Graduates to Beta along with the KubeCon EU 2022 presentation The Hitchhiker's Guide to Pod Security give great hands-on tutorials to learn.

https://kubernetes.io/blog/2022/08/23/podsecuritypolicy-the-historical-context/
Blog: Kubernetes v1.25: Combiner
• sorangutan

1

1
Posts

23
Views
Authors: Kubernetes 1.25 Release Team

Announcing the release of Kubernetes v1.25!

This release includes a total of 40 enhancements. Fifteen of those enhancements are entering Alpha, ten are graduating to Beta, and thirteen are graduating to Stable. We also have two features being deprecated or removed.

Release theme and logo

Kubernetes 1.25: Combiner

The theme for Kubernetes v1.25 is Combiner.

The Kubernetes project itself is made up of many, many individual components that, when combined, take the form of the project you see today. It is also built and maintained by many individuals, all of them with different skills, experiences, histories, and interests, who join forces not just as the release team but as the many SIGs that support the project and the community year-round.

With this release we wish to honor the collaborative, open spirit that takes us from isolated developers, writers, and users spread around the globe to a combined force capable of changing the world. Kubernetes v1.25 includes a staggering 40 enhancements, none of which would exist without the incredible power we have when we work together.

Inspired by our release lead's son, Albert Song, Kubernetes v1.25 is named for each and every one of you, no matter how you choose to contribute your unique power to the combined force that becomes Kubernetes.

What's New (Major Themes)

PodSecurityPolicy is removed; Pod Security Admission graduates to Stable

PodSecurityPolicy was initially deprecated in v1.21, and with the release of v1.25, it has been removed. The updates required to improve its usability would have introduced breaking changes, so it became necessary to remove it in favor of a more friendly replacement. That replacement is Pod Security Admission, which graduates to Stable with this release. If you are currently relying on PodSecurityPolicy, please follow the instructions for migration to Pod Security Admission.

Ephemeral Containers Graduate to Stable

Ephemeral Containers are containers that exist for only a limited time within an existing pod. This is particularly useful for troubleshooting when you need to examine another container but cannot use kubectl exec because that container has crashed or its image lacks debugging utilities. Ephemeral containers graduated to Beta in Kubernetes v1.23, and with this release, the feature graduates to Stable.

Support for cgroups v2 Graduates to Stable

It has been more than two years since the Linux kernel cgroups v2 API was declared stable. With some distributions now defaulting to this API, Kubernetes must support it to continue operating on those distributions. cgroups v2 offers several improvements over cgroups v1, for more information see the cgroups v2 documentation. While cgroups v1 will continue to be supported, this enhancement puts us in a position to be ready for its eventual deprecation and replacement.

Improved Windows support
- Performance dashboards added support for Windows
- Unit tests added support for Windows
- Conformance tests added support for Windows
- New GitHub repository created for Windows Operational Readiness
Moved container registry service from k8s.gcr.io to registry.k8s.io

Moving container registry from k8s.gcr.io to registry.k8s.io got merged. For more details, see the wiki page, annoucement was sent to the kubernetes development mailing list.

Promoted SeccompDefault to Beta

SeccompDefault promoted to beta, see the tutorial Restrict a Container's Syscalls with seccomp for more details.

Promoted endPort in Network Policy to Stable

Promoted endPort in Network Policy to GA. Network Policy providers that support endPort field now can use it to specify a range of ports to apply a Network Policy. Previously, each Network Policy could only target a single port.

Please be aware that endPort field must be supported by the Network Policy provider. If your provider does not support endPort, and this field is specified in a Network Policy, the Network Policy will be created covering only the port field (single port).

Promoted Local Ephemeral Storage Capacity Isolation to Stable

The Local Ephemeral Storage Capacity Isolation feature moved to GA. This was introduced as alpha in 1.8, moved to beta in 1.10, and it is now a stable feature. It provides support for capacity isolation of local ephemeral storage between pods, such as EmptyDir, so that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of local ephemeral storage exceeds that limit.

Promoted core CSI Migration to Stable

CSI Migration is an ongoing effort that SIG Storage has been working on for a few releases. The goal is to move in-tree volume plugins to out-of-tree CSI drivers and eventually remove the in-tree volume plugins. The core CSI Migration feature moved to GA. CSI Migration for GCE PD and AWS EBS also moved to GA. CSI Migration for vSphere remains in beta (but is on by default). CSI Migration for Portworx moved to Beta (but is off-by-default).

Promoted CSI Ephemeral Volume to Stable

The CSI Ephemeral Volume feature allows CSI volumes to be specified directly in the pod specification for ephemeral use cases. They can be used to inject arbitrary states, such as configuration, secrets, identity, variables or similar information, directly inside pods using a mounted volume. This was initially introduced in 1.15 as an alpha feature, and it moved to GA. This feature is used by some CSI drivers such as the secret-store CSI driver.

Promoted CRD Validation Expression Language to Beta

CRD Validation Expression Language is promoted to beta, which makes it possible to declare how custom resources are validated using the Common Expression Language (CEL). Please see the validation rules guide.

Promoted Server Side Unknown Field Validation to Beta

Promoted the ServerSideFieldValidation feature gate to beta (on by default). This allows optionally triggering schema validation on the API server that errors when unknown fields are detected. This allows the removal of client-side validation from kubectl while maintaining the same core functionality of erroring out on requests that contain unknown or invalid fields.

Introduced KMS v2 API

Introduce KMS v2alpha1 API to add performance, rotation, and observability improvements. Encrypt data at rest (ie Kubernetes Secrets) with DEK using AES-GCM instead of AES-CBC for kms data encryption. No user action is required. Reads with AES-GCM and AES-CBC will continue to be allowed. See the guide Using a KMS provider for data encryption for more information.

Other Updates

Graduations to Stable

This release includes a total of thirteen enhancements promoted to stable:
Deprecations and Removals

Two features were deprecated or removed from Kubernetes with this release.
- PodSecurityPolicy is removed
- GlusterFS plugin deprecated from available in-tree drivers
Release Notes

The complete details of the Kubernetes v1.25 release are available in our release notes.

Availability

Kubernetes v1.25 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using containers as “nodes”, with kind. You can also easily install 1.25 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that, when combined, make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.25 release for our community. Every one of you had a part to play in building this, and you all executed beautifully. We would like to extend special thanks to our fearless release lead, Cici Huang, for all she did to guarantee we had what we needed to succeed.

User Highlights
- Finleap Connect operates in a highly regulated environment. In 2019, they had five months to implement mutual TLS (mTLS) across all services in their clusters for their business code to comply with the new European PSD2 payment directive.
- PNC sought to develop a way to ensure new code would meet security standards and audit compliance requirements automatically—replacing the cumbersome 30-day manual process they had in place. Using Knative, PNC developed internal tools to automatically check new code and changes to existing code.
- Nexxiot needed highly-reliable, secure, performant, and cost efficient Kubernetes clusters. They turned to Cilium as the CNI to lock down their clusters and enable resilient networking with reliable day two operations.
- Because the process of creating cyber insurance policies is a complicated multi-step process, At-Bay sought to improve operations by using asynchronous message-based communication patterns/facilities. They determined that Dapr fulfilled its desired list of requirements and much more.
Ecosystem Updates
- KubeCon + CloudNativeCon North America 2022 will take place in Detroit, Michigan from 24 – 28 October 2022! You can find more information about the conference and registration on the event site.
- KubeDay event series kicks off with KubeDay Japan December 7! Register or submit a proposal on the event site
- In the 2021 Cloud Native Survey, the CNCF saw record Kubernetes and container adoption. Take a look at the results of the survey.
Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.24 release cycle, which ran for 15 weeks (May 23 to August 23), we saw contributions from 1065 companies and 1620 individuals.

Upcoming Release Webinar

Join members of the Kubernetes v1.25 release team on Thursday September 22, 2022 10am – 11am PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:
- Find out more about contributing to Kubernetes at the Kubernetes Contributors website
- Follow us on Twitter @Kubernetesio for the latest updates
- Join the community discussion on Discuss
- Join the community on Slack
- Post questions (or answer questions) on Server Fault.
- Share your Kubernetes story
- Read more about what’s happening with Kubernetes on the blog
- Learn more about the Kubernetes Release Team
https://kubernetes.io/blog/2022/08/23/kubernetes-v1-25-release/
Blog: Spotlight on SIG Storage
• sorangutan

1

1
Posts

34
Views
Author: Frederico Muñoz (SAS)

Since the very beginning of Kubernetes, the topic of persistent data and how to address the requirement of stateful applications has been an important topic. Support for stateless deployments was natural, present from the start, and garnered attention, becoming very well-known. Work on better support for stateful applications was also present from early on, with each release increasing the scope of what could be run on Kubernetes.

Message queues, databases, clustered filesystems: these are some examples of the solutions that have different storage requirements and that are, today, increasingly deployed in Kubernetes. Dealing with ephemeral and persistent storage, local or remote, file or block, from many different vendors, while considering how to provide the needed resiliency and data consistency that users expect, all of this is under SIG Storage's umbrella.

In this SIG Storage spotlight, Frederico Muñoz (Cloud & Architecture Lead at SAS) talked with Xing Yang, Tech Lead at VMware and co-chair of SIG Storage, on how the SIG is organized, what are the current challenges and how anyone can get involved and contribute.

About SIG Storage

Frederico (FSM): Hello, thank you for the opportunity of learning more about SIG Storage. Could you tell us a bit about yourself, your role, and how you got involved in SIG Storage.

Xing Yang (XY): I am a Tech Lead at VMware, working on Cloud Native Storage. I am also a Co-Chair of SIG Storage. I started to get involved in K8s SIG Storage at the end of 2017, starting with contributing to the VolumeSnapshot project. At that time, the VolumeSnapshot project was still in an experimental, pre-alpha stage. It needed contributors. So I volunteered to help. Then I worked with other community members to bring VolumeSnapshot to Alpha in K8s 1.12 release in 2018, Beta in K8s 1.17 in 2019, and eventually GA in 1.20 in 2020.

FSM: Reading the SIG Storage charter alone it’s clear that SIG Storage covers a lot of ground, could you describe how the SIG is organised?

XY: In SIG Storage, there are two Co-Chairs and two Tech Leads. Saad Ali from Google and myself are Co-Chairs. Michelle Au from Google and Jan Šafránek from Red Hat are Tech Leads.

We have bi-weekly meetings where we go through features we are working on for each particular release, getting the statuses, making sure each feature has dev owners and reviewers working on it, and reminding people about the release deadlines, etc. More information on the SIG is on the community page. People can also add PRs that need attention, design proposals that need discussion, and other topics to the meeting agenda doc. We will go over them after project tracking is done.

We also have other regular meetings, i.e., CSI Implementation meeting, Object Bucket API design meeting, and one-off meetings for specific topics if needed. There is also a K8s Data Protection Workgroup that is sponsored by SIG Storage and SIG Apps. SIG Storage owns or co-owns features that are being discussed at the Data Protection WG.

Storage and Kubernetes

FSM: Storage is such a foundational component in so many things, not least in Kubernetes: what do you think are the Kubernetes-specific challenges in terms of storage management?

XY: In Kubernetes, there are multiple components involved for a volume operation. For example, creating a Pod to use a PVC has multiple components involved. There are the Attach Detach Controller and the external-attacher working on attaching the PVC to the pod. There’s the Kubelet that works on mounting the PVC to the pod. Of course the CSI driver is involved as well. There could be race conditions sometimes when coordinating between multiple components.

Another challenge is regarding core vs Custom Resource Definitions (CRD), not really storage specific. CRD is a great way to extend Kubernetes capabilities while not adding too much code to the Kubernetes core itself. However, this also means there are many external components that are needed when running a Kubernetes cluster.

From the SIG Storage side, one most notable example is Volume Snapshot. Volume Snapshot APIs are defined as CRDs. API definitions and controllers are out-of-tree. There is a common snapshot controller and a snapshot validation webhook that should be deployed on the control plane, similar to how kube-controller-manager is deployed. Although Volume Snapshot is a CRD, it is a core feature of SIG Storage. It is recommended for the K8s cluster distros to deploy Volume Snapshot CRDs, the snapshot controller, and the snapshot validation webhook, however, most of the time we don’t see distros deploy them. So this becomes a problem for the storage vendors: now it becomes their responsibility to deploy these non-driver specific common components. This could cause conflicts if a customer wants to use more than one storage system and deploy more than one CSI driver.

FSM: Not only the complexity of a single storage system, you have to consider how they will be used together in Kubernetes?

XY: Yes, there are many different storage systems that can provide storage to containers in Kubernetes. They don’t work the same way. It is challenging to find a solution that works for everyone.

FSM: Storage in Kubernetes also involves interacting with external solutions, perhaps more so than other parts of Kubernetes. Is this interaction with vendors and external providers challenging? Has it evolved with time in any way?

XY: Yes, it is definitely challenging. Initially Kubernetes storage had in-tree volume plugin interfaces. Multiple storage vendors implemented in-tree interfaces and have volume plugins in the Kubernetes core code base. This caused lots of problems. If there is a bug in a volume plugin, it affects the entire Kubernetes code base. All volume plugins must be released together with Kubernetes. There was no flexibility if storage vendors need to fix a bug in their plugin or want to align with their own product release.

FSM: That’s where CSI enters the game?

XY: Exactly, then there comes Container Storage Interface (CSI). This is an industry standard trying to design common storage interfaces so that a storage vendor can write one plugin and have it work across a range of container orchestration systems (CO). Now Kubernetes is the main CO, but back when CSI just started, there were Docker, Mesos, Cloud Foundry, in addition to Kubernetes. CSI drivers are out-of-tree so bug fixes and releases can happen at their own pace.

CSI is definitely a big improvement compared to in-tree volume plugins. Kubernetes implementation of CSI has been GA since the 1.13 release. It has come a long way. SIG Storage has been working on moving in-tree volume plugins to out-of-tree CSI drivers for several releases now.

FSM: Moving drivers away from the Kubernetes main tree and into CSI was an important improvement.

XY: CSI interface is an improvement over the in-tree volume plugin interface, however, there are still challenges. There are lots of storage systems. Currently there are more than 100 CSI drivers listed in CSI driver docs. These storage systems are also very diverse. So it is difficult to design a common API that works for all. We introduced capabilities at CSI driver level, but we also have challenges when volumes provisioned by the same driver have different behaviors. The other day we just had a meeting discussing Per Volume CSI Driver Capabilities. We have a problem differentiating some CSI driver capabilities when the same driver supports both block and file volumes. We are going to have follow up meetings to discuss this problem.

Ongoing challenges

FSM: Specifically for the 1.25 release we can see that there are a relevant number of storage-related KEPs in the pipeline, would you say that this release is particularly important for the SIG?

XY: I wouldn’t say one release is more important than other releases. In any given release, we are working on a few very important things.

FSM: Indeed, but are there any 1.25 specific specificities and highlights you would like to point out though?

XY: Yes. For the 1.25 release, I want to highlight the following:
- CSI Migration is an on-going effort that SIG Storage has been working on for a few releases now. The goal is to move in-tree volume plugins to out-of-tree CSI drivers and eventually remove the in-tree volume plugins. There are 7 KEPs that we are targeting in 1.25 are related to CSI migration. There is one core KEP for the general CSI Migration feature. That is targeting GA in 1.25. CSI Migration for GCE PD and AWS EBS are targeting GA. CSI Migration for vSphere is targeting to have the feature gate on by default while staying in 1.25 that are in Beta. Ceph RBD and PortWorx are targeting Beta, with feature gate off by default. Ceph FS is targeting Alpha.
- The second one I want to highlight is COSI, the Container Object Storage Interface. This is a sub-project under SIG Storage. COSI proposes object storage Kubernetes APIs to support orchestration of object store operations for Kubernetes workloads. It also introduces gRPC interfaces for object storage providers to write drivers to provision buckets. The COSI team has been working on this project for more than two years now. The COSI feature is targeting Alpha in 1.25. The KEP just got merged. The COSI team is working on updating the implementation based on the updated KEP.
- Another feature I want to mention is CSI Ephemeral Volume support. This feature allows CSI volumes to be specified directly in the pod specification for ephemeral use cases. They can be used to inject arbitrary states, such as configuration, secrets, identity, variables or similar information, directly inside pods using a mounted volume. This was initially introduced in 1.15 as an alpha feature, and it is now targeting GA in 1.25.
FSM: If you had to single something out, what would be the most pressing areas the SIG is working on?

XY: CSI migration is definitely one area that the SIG has put in lots of effort and it has been on-going for multiple releases now. It involves work from multiple cloud providers and storage vendors as well.

Community involvement

FSM: Kubernetes is a community-driven project. Any recommendation for anyone looking into getting involved in SIG Storage work? Where should they start?

XY: Take a look at the SIG Storage community page, it has lots of information on how to get started. There are SIG annual reports that tell you what we did each year. Take a look at the Contributing guide. It has links to presentations that can help you get familiar with Kubernetes storage concepts.

Join our bi-weekly meetings on Thursdays. Learn how the SIG operates and what we are working on for each release. Find a project that you are interested in and help out. As I mentioned earlier, I got started in SIG Storage by contributing to the Volume Snapshot project.

FSM: Any closing thoughts you would like to add?

XY: SIG Storage always welcomes new contributors. We need contributors to help with building new features, fixing bugs, doing code reviews, writing tests, monitoring test grid health, and improving documentation, etc.

FSM: Thank you so much for your time and insights into the workings of SIG Storage!

https://kubernetes.io/blog/2022/08/22/sig-storage-spotlight/

7 / 19

Blog: Current State: 2019 Third Party Security Audit of Kubernetes • sorangutan

Motivation

Current State

Inspired outcomes

Remaining Work

Blog: Introducing Kueue • sorangutan

Why Kueue?

How Kueue works

Resource model

Example use case

Future work and getting involved

Blog: Kubernetes 1.25: alpha support for running Pods with user namespaces • sorangutan

How does it work?

Why is it important?

How do I enable user namespaces?

How do I get involved?

Blog: Enforce CRD Immutability with CEL Transition Rules • sorangutan

Basics of validation rules

Immutability patterns with CEL validation rules

Project setup

Immutablility after first modification

Example usage

Generated schema

Immutability upon object creation

Usage example

Generated schema

Append-only list of containers

Example usage

Generated schema

Map with append-only keys, immutable values

Example usage

Generated schema

Going further

Blog: Kubernetes 1.25: Kubernetes In-Tree to CSI Volume Migration Status Update • sorangutan

Quick recap: What is CSI Migration, and why migrate?

What is the timeline / status?

What does it mean for the core CSI Migration feature to go GA?

What does it mean for the storage driver CSI migration to go GA?

What's next?

How do I get involved?

Blog: Kubernetes 1.25: CustomResourceDefinition Validation Rules Graduate to Beta • sorangutan

Why CEL?

Why not use validation webhooks?

Getting started with validation rules

Writing validation rules in OpenAPIv3 schemas

CRD transition rules

Using the Functions Libraries

Resource use and limits

Estimated cost limit

Runtime cost limits for CRD validation rules

Future work

Acknowledgements

Blog: Kubernetes 1.25: Use Secrets for Node-Driven Expansion of CSI Volumes • sorangutan

What is this all about?

How does it work?

Trying it out

Example output

The future

Get involved or learn more?

Blog: Kubernetes 1.25: Local Storage Capacity Isolation Reaches GA • sorangutan

How to use local storage capacity isolation

Setting requests and limits for local ephemeral storage

Setting resource quota and limitRange for local ephemeral storage

How do I get involved?

Blog: Kubernetes 1.25: Two Features for Apps Rollouts Graduate to Stable • sorangutan

What problems do these features solve?

minReadySeconds for StatefulSets

maxSurge for DaemonSets

How does it work?

minReadySeconds for StatefulSets

maxSurge for DaemonSets

How do I use it?

minReadySeconds for StatefulSets

maxSurge for DaemonSets

How can I learn more?

minReadySeconds for StatefulSets

maxSurge for DaemonSets

How do I get involved?

Blog: Kubernetes 1.25: PodHasNetwork condition for pods • sorangutan

How is this different from the existing Initialized condition reported for pods?

Blog: Current State: 2019 Third Party Security Audit of Kubernetes
• sorangutan

Blog: Introducing Kueue
• sorangutan

Blog: Kubernetes 1.25: alpha support for running Pods with user namespaces
• sorangutan

Blog: Enforce CRD Immutability with CEL Transition Rules
• sorangutan

Blog: Kubernetes 1.25: Kubernetes In-Tree to CSI Volume Migration Status Update
• sorangutan

Blog: Kubernetes 1.25: CustomResourceDefinition Validation Rules Graduate to Beta
• sorangutan

Blog: Kubernetes 1.25: Use Secrets for Node-Driven Expansion of CSI Volumes
• sorangutan

Blog: Kubernetes 1.25: Local Storage Capacity Isolation Reaches GA
• sorangutan

Blog: Kubernetes 1.25: Two Features for Apps Rollouts Graduate to Stable
• sorangutan

Blog: Kubernetes 1.25: PodHasNetwork condition for pods
• sorangutan

Blog: Announcing the Auto-refreshing Official Kubernetes CVE Feed
• sorangutan

Blog: Kubernetes 1.25: KMS V2 Improvements
• sorangutan

What’s new in `v2alpha1`?

Blog: Kubernetes’s IPTables Chains Are Not API
• sorangutan

If you use `KUBE-MARK-MASQ`...

If you use `KUBE-MARK-DROP`...

If you use Kubelet’s iptables rules to figure out `iptables-legacy` vs `iptables-nft`...

Blog: Introducing COSI: Object Storage Management using Kubernetes APIs
• sorangutan

Blog: Kubernetes 1.25: cgroup v2 graduates to GA
• sorangutan