DevOps & SRE notes
12.2K subscribers
45 photos
19 files
2.52K links
Helpful articles and tools for DevOps&SRE

WhatsApp: https://whatsapp.com/channel/0029Vb79nmmHVvTUnc4tfp2F

For paid consultation (RU/EN), contact: @tutunak


All ways to support https://telegra.ph/How-support-the-channel-02-19
Download Telegram
If you, like me, use linters in the pipeline for GitOps repositories, this repo is the best thing you can use. It contains popular Kubernetes CRDs (CustomResourceDefinition) in JSON schema format.

https://github.com/datreeio/CRDs-catalog
👍3🔥31
The article clarifies the distinction between Platform Engineering (focused on velocity and Developer Experience/DevEx) and Site Reliability Engineering (focused on stability and production health). It argues that while their daily tasks differ, they must be integrated: Platform Engineers build the "golden paths" that abstract infrastructure complexity, while SREs ensure those paths are robust, scalable, and monitored.

https://octopus.com/devops/platform-engineering/platform-engineering-vs-sre/
👍2
Today I read the article “What Would a Kubernetes 2.0 Look Like?
Thoughts on what the next major version might be. And found this :)
YAML is just too much for what we're trying to do with k8s and it's not a safe enough format. Indentation is error-prone, the files don't scale great (you really don't want a super long YAML file), debugging can be annoying. YAML has so many subtle behaviors outlined in its spec.


HCL is already the format for Terraform, so at least we'd only have to hate one configuration language instead of two. It's strongly typed with explicit types. There's already good validation mechanisms. It is specifically designed to do the job that we are asking YAML to do and it's not much harder to read.


and realized that Kubernetes developers had the same thoughts about using YAML- but instead of HCL, they just invented their own HCL-like language: KYAML.
🤣14
Recently I searched for a simple solution that allows developers to be notified about changes in ConfigMaps. I tried to find a simple solution, and to my surprise, there is only one simple and straightforward solution that does only one thing, and that is Kubewatch. So, if you would like to have a simple solution that can be used for notifying about changes to objects in your K8s cluster, choose Kubewatch.

https://github.com/robusta-dev/kubewatch
👍32
Anyone who has been on call at night knows that it's impossible to react within minutes and triage an incident fast enough, especially if you are in such cases very rarely. When you are paged once a quarter or a year, all your dashboards are outdated, your diagnostic skills are lacking, and your understanding of the system has already changed a great deal. In those cases, a current AI agent can be useful. Looking at these article, I see that by the time you get paged, wake up, turn on your laptop, and try to open your eyes, the agent can already triage the incident and provide a full report with recommendations. Yes, we still need a human to approve those changes or do them manually (as with planes, people prefer to see a live human pilot, but autopilots are already better than humans).

https://www.opsworker.ai/blog/agent-driven-sre-investigations-a-practical-deep-dive-into-multi-agent-incident-response/
👍5🔥2🎉1
The article "What happens inside the Kubernetes API server?" has been updated. It is a good starting point for preparing for your next K8s job interview.

https://learnkube.com/kubernetes-api-explained
👍5
Sometimes finding a good solution for backups is a difficult task, but for many years one of the main tools I’ve used for backing up my workstation is Restic. I’ve used it on Linux, macOS, and Windows, and it works perfectly — delivering backups to HDDs and Backblaze. I can recommend it to everyone: it’s quite fast, reliable, and an optimal solution for most file-backup cases.


https://github.com/restic/restic

What backup solution do you use? Share it in the comments 👇
Please open Telegram to view this post
VIEW IN TELEGRAM
👍7💯1
The article details Salesforce’s transition from the traditional AWS Cluster Autoscaler (based on Auto Scaling Groups) to Karpenter. To manage this at a massive scale, Salesforce built custom automation tools to handle non-disruptive migrations, mapped over 1,180 diverse node pool configurations, and implemented a phased rollout that reduced operational overhead by 80% and improved scaling speed from minutes to seconds.

https://aws.amazon.com/blogs/architecture/how-salesforce-migrated-from-cluster-autoscaler-to-karpenter-across-their-fleet-of-1000-eks-clusters/
👍4
Finally, Grafana has addressed the elephant in the room.

Let's be honest, the previous "Grafana as Code" management was terrible. Whether it was the clunky provisioning system or the need for endless sidecars and scripts, it always felt like a hack.

They have now introduced Grafana Git Sync. You can connect a repository directly to Grafana, and it natively syncs your dashboards and data sources from Git. No more API workarounds or messy provisioning files.

It looks like the GitOps workflow for observability might finally become usable. It’s about time.

https://grafana.com/blog/git-sync-grafana/
👍15
A small reminder: Ingress Nginx will be retired soon (in less than two weeks), so you can choose the Gateway API instead.
What is the difference between a good and a bad commit message? A good commit message tells you the reason why the change was made, while a bad one tells you nothing or repeats the same changes in the code. This commit doesn't explain anything and provides zero motivation; it is just a fact not attached to any reasoning.
https://github.com/kubernetes/kubernetes/commit/94f7f922054d0aa4aa07d572a940ec0dda842646#diff-b2ad44e189798d2d03c3b05e0334899474353de68e03d71653b69ea5fd807c87L287-L387
👍4💯32
The original article is behind a paywall.

TL;DR Amazon service was taken down by AI coding bot

Amazon’s cloud unit has suffered at least two outages due to errors involving its own AI tools, leading some employees to raise doubts about the US tech giant’s push to roll out these coding assistants.

Amazon Web Services experienced a 13-hour interruption to one system used by its customers in mid-December after engineers allowed its Kiro AI coding tool to make certain changes, according to four people familiar with the matter.

The people said the agentic tool, which can take autonomous actions on behalf of users, determined that the best course of action was to “delete and recreate the environment”.

Amazon posted an internal postmortem about the “outage” of the AWS system, which lets customers explore the costs of its services.

Multiple Amazon employees told the FT that this was the second occasion in recent months in which one of the group’s AI tools had been at the centre of a service disruption.

“We’ve already seen at least two production outages [in the past few months],” said one senior AWS employee. “The engineers let the AI [agent] resolve an issue without intervention. The outages were small but entirely foreseeable.”

AWS, which accounts for 60 per cent of Amazon’s operating profits, is seeking to build and deploy AI tools including “agents” capable of taking actions independently based on human instructions.

Like many Big Tech companies, it is seeking to sell this technology to outside customers. The incidents highlight the risk that these nascent AI tools can misbehave and cause disruptions.

Amazon said it was a “coincidence that AI tools were involved” and that “the same issue could occur with any developer tool or manual action”.

“In both instances, this was user error, not AI error,” Amazon said, adding that it had not seen evidence that mistakes were more common with AI tools.

The company said the incident in December was an “extremely limited event” affecting only a single service in parts of mainland China. Amazon added that the second incident did not have an impact on a “customer facing AWS service”.

Neither disruption was anywhere near as severe as a 15-hour AWS outage in October 2025 that forced multiple customers’ apps and websites offline — including OpenAI’s ChatGPT.

Employees said the group’s AI tools were treated as an extension of an operator and given the same permissions. In these two cases, the engineers involved did not require a second person’s approval before making changes, as would normally be the case.

Amazon said that by default its Kiro tool “requests authorisation before taking any action” but said the engineer involved in the December incident had “broader permissions than expected — a user access control issue, not an AI autonomy issue”.

AWS launched Kiro in July. It said the coding assistant would advance beyond “vibe coding” — which allows users to quickly build applications — to instead write code based on a set of specifications.

The group had earlier relied on its Amazon Q Developer product, an AI-enabled chatbot, to help engineers write code. This was involved in the earlier outage, three of the employees said.

Some Amazon employees said they were still sceptical of AI tools’ utility for the bulk of their work given the risk of error. They added that the company had set a target for 80 per cent of developers to use AI for coding tasks at least once a week and was closely tracking adoption.

Amazon said it was experiencing strong customer growth for Kiro and that it wanted customers and employees to benefit from efficiency gains.

“Following the December incident, AWS implemented numerous safeguards”, including mandatory peer review and staff training, Amazon added.

src: https://www.ft.com/content/00c282de-ed14-4acd-a948-bc8d6bdb339d
👏2😱21🔥1
AWS Cost Optimization Game Day — a hands-on, interactive session focused on improving cloud efficiency and reducing costs in real-world scenarios.

You’ll collaborate, analyze architectures, uncover cost-saving opportunities, and compete in a fun, gamified environment.

Ready to optimize and win?
Let’s play smart with AWS!

When: Wednesday, Mar 11 · 4:30 PM to 7:30 PM GMT+2
Language: English

Registration link is here
🔥42👍1
Understanding how many pods your infrastructure can actually support is crucial for reliability. This overview breaks down the nuances of Kubernetes cluster capacity and resource allocation.
https://dnastacio.medium.com/kubernetes-cluster-capacity-d96d0d82b380
👍2