DevOps & SRE notes

The article "What happens inside the Kubernetes API server?" has been updated. It is a good starting point for preparing for your next K8s job interview. https://learnkube.com/kubernetes-api-explained

Not only has this article been updated, but the post "How does the Kubernetes scheduler work?" has been as well.

https://learnkube.com/kubernetes-scheduler-explained

LearnKube

How does the Kubernetes scheduler work?

Learn how the Kubernetes scheduler filters, scores, and assigns pods to nodes.

👍6🔥2❤1

2.31K viewstutunak, 11:13

DevOps & SRE notes

Sometimes finding a good solution for backups is a difficult task, but for many years one of the main tools I’ve used for backing up my workstation is Restic. I’ve used it on Linux, macOS, and Windows, and it works perfectly — delivering backups to HDDs and Backblaze. I can recommend it to everyone: it’s quite fast, reliable, and an optimal solution for most file-backup cases.

https://github.com/restic/restic

What backup solution do you use? Share it in the comments 👇

Please open Telegram to view this post

VIEW IN TELEGRAM

GitHub

GitHub - restic/restic: Fast, secure, efficient backup program

Fast, secure, efficient backup program. Contribute to restic/restic development by creating an account on GitHub.

👍7❤1💯1

2.31K viewstutunak, 09:01

DevOps & SRE notes

The article details Salesforce’s transition from the traditional AWS Cluster Autoscaler (based on Auto Scaling Groups) to Karpenter. To manage this at a massive scale, Salesforce built custom automation tools to handle non-disruptive migrations, mapped over 1,180 diverse node pool configurations, and implemented a phased rollout that reduced operational overhead by 80% and improved scaling speed from minutes to seconds.

https://aws.amazon.com/blogs/architecture/how-salesforce-migrated-from-cluster-autoscaler-to-karpenter-across-their-fleet-of-1000-eks-clusters/

Amazon

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters | Amazon Web Services

This blog post examines how Salesforce, operating one of the world's largest Kubernetes deployments, successfully migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 plus Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

👍4

2.67K viewstutunak, 09:03

DevOps & SRE notes

Finally, Grafana has addressed the elephant in the room.

Let's be honest, the previous "Grafana as Code" management was terrible. Whether it was the clunky provisioning system or the need for endless sidecars and scripts, it always felt like a hack.

They have now introduced Grafana Git Sync. You can connect a repository directly to Grafana, and it natively syncs your dashboards and data sources from Git. No more API workarounds or messy provisioning files.

It looks like the GitOps workflow for observability might finally become usable. It’s about time.

https://grafana.com/blog/git-sync-grafana/

Grafana Labs

Grafana dashboards as code: How to manage your dashboards with Git | Grafana Labs

Git Sync makes it easy to save and manage your dashboards as code—no complex setup required.

👍15

2.85K viewstutunak, 08:31

DevOps & SRE notes

A small reminder: Ingress Nginx will be retired soon (in less than two weeks), so you can choose the Gateway API instead.

2.13K viewstutunak, 07:37

DevOps & SRE notes

Forwarded from DevOps & SRE notes (tutunak)

Ingress Nginx will be retired, time to choose a gateway api.

Gateway API Benchmarks provides a common set of tests to evaluate a Gateway API implementation.

https://github.com/howardjohn/gateway-api-bench

GitHub

GitHub - howardjohn/gateway-api-bench: Gateway API Benchmarks provides a common set of tests to evaluate a Gateway API implementation.

Gateway API Benchmarks provides a common set of tests to evaluate a Gateway API implementation. - howardjohn/gateway-api-bench

👍5❤1

2.27K viewstutunak, 07:37

DevOps & SRE notes

The original article is behind a paywall.

TL;DR Amazon service was taken down by AI coding bot

Amazon’s cloud unit has suffered at least two outages due to errors involving its own AI tools, leading some employees to raise doubts about the US tech giant’s push to roll out these coding assistants.

Amazon Web Services experienced a 13-hour interruption to one system used by its customers in mid-December after engineers allowed its Kiro AI coding tool to make certain changes, according to four people familiar with the matter.

The people said the agentic tool, which can take autonomous actions on behalf of users, determined that the best course of action was to “delete and recreate the environment”.

Amazon posted an internal postmortem about the “outage” of the AWS system, which lets customers explore the costs of its services.

Multiple Amazon employees told the FT that this was the second occasion in recent months in which one of the group’s AI tools had been at the centre of a service disruption.

“We’ve already seen at least two production outages [in the past few months],” said one senior AWS employee. “The engineers let the AI [agent] resolve an issue without intervention. The outages were small but entirely foreseeable.”

AWS, which accounts for 60 per cent of Amazon’s operating profits, is seeking to build and deploy AI tools including “agents” capable of taking actions independently based on human instructions.

Like many Big Tech companies, it is seeking to sell this technology to outside customers. The incidents highlight the risk that these nascent AI tools can misbehave and cause disruptions.

Amazon said it was a “coincidence that AI tools were involved” and that “the same issue could occur with any developer tool or manual action”.

“In both instances, this was user error, not AI error,” Amazon said, adding that it had not seen evidence that mistakes were more common with AI tools.

The company said the incident in December was an “extremely limited event” affecting only a single service in parts of mainland China. Amazon added that the second incident did not have an impact on a “customer facing AWS service”.

Neither disruption was anywhere near as severe as a 15-hour AWS outage in October 2025 that forced multiple customers’ apps and websites offline — including OpenAI’s ChatGPT.

Employees said the group’s AI tools were treated as an extension of an operator and given the same permissions. In these two cases, the engineers involved did not require a second person’s approval before making changes, as would normally be the case.

Amazon said that by default its Kiro tool “requests authorisation before taking any action” but said the engineer involved in the December incident had “broader permissions than expected — a user access control issue, not an AI autonomy issue”.

AWS launched Kiro in July. It said the coding assistant would advance beyond “vibe coding” — which allows users to quickly build applications — to instead write code based on a set of specifications.

The group had earlier relied on its Amazon Q Developer product, an AI-enabled chatbot, to help engineers write code. This was involved in the earlier outage, three of the employees said.

Some Amazon employees said they were still sceptical of AI tools’ utility for the bulk of their work given the risk of error. They added that the company had set a target for 80 per cent of developers to use AI for coding tasks at least once a week and was closely tracking adoption.

Amazon said it was experiencing strong customer growth for Kiro and that it wanted customers and employees to benefit from efficiency gains.

“Following the December incident, AWS implemented numerous safeguards”, including mandatory peer review and staff training, Amazon added.

src: https://www.ft.com/content/00c282de-ed14-4acd-a948-bc8d6bdb339d

👏2😱2❤1🔥1

2.37K viewstutunak, edited 21:19

DevOps & SRE notes

The original article is behind a paywall. TL;DR Amazon service was taken down by AI coding bot Amazon’s cloud unit has suffered at least two outages due to errors involving its own AI tools, leading some employees to raise doubts about the US tech giant’s…

Amazon's response
https://www.aboutamazon.com/news/aws/aws-service-outage-ai-bot-kiro

👍1

2.3K viewstutunak, 12:12

DevOps & SRE notes

Ingress Nginx will be retired, time to choose a gateway api. Gateway API Benchmarks provides a common set of tests to evaluate a Gateway API implementation. https://github.com/howardjohn/gateway-api-bench

The second part of bench
https://github.com/howardjohn/gateway-api-bench/blob/main/README-v2.md

GitHub

gateway-api-bench/README-v2.md at main · howardjohn/gateway-api-bench

Gateway API Benchmarks provides a common set of tests to evaluate a Gateway API implementation. - howardjohn/gateway-api-bench

🔥3👍2❤1

2.22K viewstutunak, 09:04

DevOps & SRE notes

AWS Cost Optimization Game Day — a hands-on, interactive session focused on improving cloud efficiency and reducing costs in real-world scenarios.

You’ll collaborate, analyze architectures, uncover cost-saving opportunities, and compete in a fun, gamified environment.

Ready to optimize and win?
Let’s play smart with AWS!

When: Wednesday, Mar 11 · 4:30 PM to 7:30 PM GMT+2
Language: English

Registration link is here

🔥4❤2👍1

2.47K viewstutunak, 08:56

DevOps & SRE notes

Understanding how many pods your infrastructure can actually support is crucial for reliability. This overview breaks down the nuances of Kubernetes cluster capacity and resource allocation.
https://dnastacio.medium.com/kubernetes-cluster-capacity-d96d0d82b380

Medium

Balancing Capacity and Cost for Kubernetes Clusters

Infinite Choices, but Which Is the Right One?

👍2❤1

2.55K viewstutunak, 09:54

DevOps & SRE notes

K8sQuest — A local, hands-on Kubernetes learning game with real-world troubleshooting challenges. Practice Pods, Deployments, Services, networking, storage, and debugging using kubectl on a local cluster (kind/k3d). No cloud required.

https://github.com/Manoj-engineer/k8squest

GitHub

GitHub - Manoj-engineer/k8squest: K8sQuest — A local, hands-on Kubernetes learning game with real-world troubleshooting challenges.…

🔥14❤1

2.38K viewstutunak, 09:01

DevOps & SRE notes

A good starting point for finding a Helm chart that is not officially provided by the vendor is the Community Helm Chart Repository.

https://github.com/trueforge-org/truecharts

GitHub

GitHub - trueforge-org/truecharts: Community Helm Chart Repository

Community Helm Chart Repository. Contribute to trueforge-org/truecharts development by creating an account on GitHub.

👍4❤1

2.39K viewstutunak, 09:03

DevOps & SRE notes

As announced November 2025, Kubernetes will retire Ingress-NGINX in March 2026. Despite its widespread usage, Ingress-NGINX is full of surprising defaults and side effects that are probably present in your cluster today. This blog highlights these behaviors so that you can migrate away safely and make a conscious decision about which behaviors to keep. This post also compares Ingress-NGINX with Gateway API and shows you how to preserve Ingress-NGINX behavior in Gateway API. The recurring risk pattern in every section is the same: a seemingly correct translation can still cause outages if it does not consider Ingress-NGINX's quirks.

https://kubernetes.io/blog/2026/02/27/ingress-nginx-before-you-migrate/

Kubernetes

Before You Migrate: Five Surprising Ingress-NGINX Behaviors You Need to Know

👍5❤3

2.68K viewstutunak, 08:15

DevOps & SRE notes

A utility for generating Mermaid diagrams from Terraform configurations
https://github.com/RoseSecurity/Terramaid

GitHub

GitHub - RoseSecurity/Terramaid: A utility for generating Mermaid diagrams from Terraform configurations

A utility for generating Mermaid diagrams from Terraform configurations - RoseSecurity/Terramaid

❤7👍2

2.01K viewstutunak, 09:02

DevOps & SRE notes

Last week, I switched the default search in zsh (Ctrl+R) to fzf, and it’s working out pretty well.

https://github.com/junegunn/fzf

GitHub

GitHub - junegunn/fzf: :cherry_blossom: A command-line fuzzy finder

:cherry_blossom: A command-line fuzzy finder. Contribute to junegunn/fzf development by creating an account on GitHub.

🔥10💯4

1.93K viewstutunak, 09:03

DevOps & SRE notes

Although Ingress-Nginx is still maintained and receiving security updates (e.g. controller-v1.15.0), it's time to start migrating to the Gateway API. ingress2gateway can help with that.

👎3❤2🎉1

1.5K viewstutunak, edited 09:01

DevOps & SRE notes

🔥 You can now create Amazon S3 general purpose buckets in your own reserved namespace, eliminating the need to find globally unique bucket names! The announce - https://aws.amazon.com/blogs/aws/introducing-account-regional-namespaces-for-amazon-s3-general-purpose-buckets/

Please open Telegram to view this post

VIEW IN TELEGRAM

Amazon

Introducing account regional namespaces for Amazon S3 general purpose buckets | Amazon Web Services

AWS launches a new feature of Amazon S3 that lets you create general purpose buckets in your own account regional namespace simplifying bucket creation and management as your data storage needs grow in size and scope.

🔥6👏2

1.48K viewstutunak, edited 18:56

DevOps & SRE notes

Looking for a hosting platform to practice with Linux, Kubernetes, etc.? Register using my referral link on DigitalOcean and get $200 in credit for 60 days. By registering through my referral link, you also support this Telegram channel.

👉 Register

👍6❤1

1.51K viewstutunak, 08:21

DevOps & SRE notes

Short-lived public TLS certificates are our future, with a 46-day maximum validity by 2029.

https://knowledge.digicert.com/alerts/public-tls-certificates-199-day-validity

😢5😱3💯2❤1

768 viewstutunak, 12:09

About

Blog

Apps

Platform