Upgrading a critical database cluster often involves anxiety, but this practical guide outlines a method to update PostgreSQL without losing data or incurring significant downtime. It covers the essential command-line steps and verification processes needed for a smooth transition.
https://palark.com/blog/postgresql-upgrade-no-data-loss-downtime/
https://palark.com/blog/postgresql-upgrade-no-data-loss-downtime/
Palark
Upgrading PostgreSQL with no data loss and minimal downtime | Tech blog | Palark
A technical story of upgrading a production PostgreSQL cluster from v13 to v16. It focuses on high availability and minimal downtime.
👍7
Julius Volz discusses the trade-offs between different observability standards in the monitoring landscape. His argument explains why he still prefers native Prometheus instrumentation over OpenTelemetry for certain use cases.
https://promlabs.com/blog/2025/07/17/why-i-recommend-native-prometheus-instrumentation-over-opentelemetry/
https://promlabs.com/blog/2025/07/17/why-i-recommend-native-prometheus-instrumentation-over-opentelemetry/
Promlabs
Blog - Why I recommend native Prometheus instrumentation over OpenTelemetry
PromLabs - We teach Prometheus-based monitoring and observability
👍4
Finally, FluxCD has a GUI. People say it looks like ArgoCD. I’ve never used Argo, but if that’s true, it’s a good move from the FluxCD team.
The main reason for Flux’s lower adoption, in my opinion, was the lack of out-of-the-box visibility. Many people want to see the status of resources directly, rather than rely on custom notifications from their own systems or parse logs when an update hasn’t been delivered as expected.
At my current workplace, I built a feedback system that shows the current state inside a GitLab pipeline, but this approach is not efficient. It doesn’t make sense that every company has to build its own solution and spend time on this just because the tool doesn’t provide a default feedback mechanism, such as an UI.
https://fluxoperator.dev/web-ui/
The main reason for Flux’s lower adoption, in my opinion, was the lack of out-of-the-box visibility. Many people want to see the status of resources directly, rather than rely on custom notifications from their own systems or parse logs when an update hasn’t been delivered as expected.
At my current workplace, I built a feedback system that shows the current state inside a GitLab pipeline, but this approach is not efficient. It doesn’t make sense that every company has to build its own solution and spend time on this just because the tool doesn’t provide a default feedback mechanism, such as an UI.
https://fluxoperator.dev/web-ui/
Flux Operator
Web UI - Flux Operator
Mission control dashboards for Kubernetes app delivery powered by Flux CD.
👍8🔥3❤1
Not a big feature, but a small quality-of-life improvement that AWS provides. Automatic ECR repository creation is one of those features we’ve needed for a long time.
Literally a couple of weeks ago, we discussed with the team how we would automate this to simplify life for both us and the development team. Now it’s here, and we won’t have to spend time building “workarounds” around it.
https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/
Literally a couple of weeks ago, we discussed with the team how we would automate this to simplify life for both us and the development team. Now it’s here, and we won’t have to spend time building “workarounds” around it.
https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/
Amazon
Amazon ECR now supports creating repositories on push - AWS
Discover more about what's new at AWS with Amazon ECR now supports creating repositories on push
👍5
Open-source Platform for learning kubernetes and aws eks and preparation for for Certified Kubernetes exams (CKA ,CKS , CKAD)
https://github.com/ViktorUJ/cks
https://github.com/ViktorUJ/cks
GitHub
GitHub - ViktorUJ/cks: Open-source Platform for learning kubernetes and aws eks and preparation for for Certified Kubernetes…
Open-source Platform for learning kubernetes and aws eks and preparation for for Certified Kubernetes exams (CKA ,CKS , CKAD) - GitHub - ViktorUJ/cks: Open-source Platform for learning kubern...
👍9❤5🔥4
One of the simplest solutions I’ve used for managing temporal environments is the solution provided by Flux Operator. The configuration is straightforward and offers an out-of-the-box solution that can describe even complex environments. This setup makes environments available for a merge request and, at the same time, provides fast termination and cleanup of resources.
https://fluxoperator.dev/docs/resourcesets/gitlab-merge-requests/
https://fluxoperator.dev/docs/resourcesets/gitlab-merge-requests/
Flux Operator
GitLab MRs Integration - Flux Operator Docs
Flux Operator preview environments integration with GitLab Merge Requests
👍3❤🔥2❤1💯1
That is not a positive 'fragment' from Martin Fowler about AI and code quality. It sums up other studies. After reading it, if this is true, I can say that tech professionals are protected and will not lose their jobs in the new year. The demand for skilled professionals might actually increase or shift toward higher-level maintenance and architecture. Someone needs to clean up the "mess" AI might be creating, ensure long-term maintainability, and provide the deep understanding that AI currently lacks
https://martinfowler.com/articles/20251204-frags.html
https://martinfowler.com/articles/20251204-frags.html
martinfowler.com
Fragments Dec 4
a short post
👍5❤2🔥1
The more operator tools you use, the more time you will spend replacing them after deprecation. Your processes might be well-optimized, but a chain of deprecations can cause you to spend time solving problems you have already solved before, and now you have to make changes that could make your system less stable than before.
The year 2025 was a year of deprecation:
1. Kaniko was deprecated (link). Our team spent quite some time finding a solution with similar performance to avoid increasing pipeline build times.
2. NGINX Unit (link) was discontinued. Similarly, we had to find an application server that could handle high loads without slowing down.
3. Ingress-NGINX (link) was discontinued—the most impactful. The options were either to migrate to another solution or start using an API gateway.
Finding an “ideal” solution that fits your current needs doesn’t guarantee stability in the long term. One day, you might have to migrate to something new, introducing potential instability to your system.
The year 2025 was a year of deprecation:
1. Kaniko was deprecated (link). Our team spent quite some time finding a solution with similar performance to avoid increasing pipeline build times.
2. NGINX Unit (link) was discontinued. Similarly, we had to find an application server that could handle high loads without slowing down.
3. Ingress-NGINX (link) was discontinued—the most impactful. The options were either to migrate to another solution or start using an API gateway.
Finding an “ideal” solution that fits your current needs doesn’t guarantee stability in the long term. One day, you might have to migrate to something new, introducing potential instability to your system.
👍10🔥1
Forwarded from DevOps & SRE notes (tutunak)
Looking for a hosting platform to practice with Linux, Kubernetes, etc.? Register using my referral link on DigitalOcean and get $200 in credit for 60 days. By registering through my referral link, you also support this Telegram channel.
👉 Register
👉 Register
👍4
You can do everything right but still be hacked through the official SDK. A couple of mistakes (CI/CD misconfiguration, unanchored regular expressions) in the configuration of AWS CodeBuild by AWS, combined with predictable identifier generation in GitHub, resulted in granting admin access to the AWS GitHub account. The Wiz team reported a case of gaining access to the AWS GitHub. But how many companies have made similar mistakes, enabling a hacker to have already injected vulnerabilities inside widely used libraries?
https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild
https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild
wiz.io
CodeBreach: Supply Chain Vuln & AWS CodeBuild Misconfig | Wiz Blog
Wiz Research discovered CodeBreach, a critical vulnerability that risked the AWS Console supply chain. Learn how to secure your AWS CodeBuild pipelines.
🔥4❤2
The author conducts a side-by-side security experiment using Minikube to compare a standard root-privileged container against a custom non-root Alpine container. Through three distinct attack vectors, the article illustrates how non-root configurations actively block common exploitation attempts that succeed in root-privileged environments.
Key Insights:
- Tooling Denial: In a root container, an attacker can easily install missing utilities (like curl) to fetch malicious payloads. The non-root container blocks package installation and unauthorized data fetching.
- Host Path Protection: The author demonstrates that if a sensitive host directory (like /etc/kubernetes/manifests) is mounted, a root user can write to it to deploy malicious static pods (e.g., crypto miners) or read sensitive host files (/etc/passwd). The non-root user is successfully denied permission to modify these files or inject new manifests.
- Privilege Escalation Barrier: The experiment shows that standard attempts to switch users (e.g., using su) inside a non-root container fail immediately, limiting an attacker's ability to escalate privileges or move laterally without explicit sudo misconfigurations.
https://medium.com/@marcin.wasiucionek/why-is-running-as-root-in-kubernetes-containers-dangerous-e5f1a116080e
Key Insights:
- Tooling Denial: In a root container, an attacker can easily install missing utilities (like curl) to fetch malicious payloads. The non-root container blocks package installation and unauthorized data fetching.
- Host Path Protection: The author demonstrates that if a sensitive host directory (like /etc/kubernetes/manifests) is mounted, a root user can write to it to deploy malicious static pods (e.g., crypto miners) or read sensitive host files (/etc/passwd). The non-root user is successfully denied permission to modify these files or inject new manifests.
- Privilege Escalation Barrier: The experiment shows that standard attempts to switch users (e.g., using su) inside a non-root container fail immediately, limiting an attacker's ability to escalate privileges or move laterally without explicit sudo misconfigurations.
https://medium.com/@marcin.wasiucionek/why-is-running-as-root-in-kubernetes-containers-dangerous-e5f1a116080e
Medium
Why is running as root in Kubernetes containers dangerous?
Explore one of crucial Kubernetes security practices on real examples
👍5❤1
In November, a new major version of Helm was released, but for me and my colleagues it didn’t cause any excitement.
I checked the changelog and realized that there were no new features that would make my life easier or improve stability. I talked to people who use Argo, and the response was the same: it’s just another release. It even feels like it could have been a minor update rather than a major one.
https://t.iss.one/devops_sre_notes/2512
Why did this happen?
I think the current version of Helm is already good enough, especially if you are using GitOps with Argo CD or FluxCD. If you didn’t like Helm 3, you probably won’t change your opinion with the Helm 4.0 release.
I checked the changelog and realized that there were no new features that would make my life easier or improve stability. I talked to people who use Argo, and the response was the same: it’s just another release. It even feels like it could have been a minor update rather than a major one.
https://t.iss.one/devops_sre_notes/2512
Why did this happen?
I think the current version of Helm is already good enough, especially if you are using GitOps with Argo CD or FluxCD. If you didn’t like Helm 3, you probably won’t change your opinion with the Helm 4.0 release.
👍5👌2💯2❤1
If you, like me, use linters in the pipeline for GitOps repositories, this repo is the best thing you can use. It contains popular Kubernetes CRDs (CustomResourceDefinition) in JSON schema format.
https://github.com/datreeio/CRDs-catalog
https://github.com/datreeio/CRDs-catalog
GitHub
GitHub - datreeio/CRDs-catalog: Popular Kubernetes CRDs (CustomResourceDefinition) in JSON schema format.
Popular Kubernetes CRDs (CustomResourceDefinition) in JSON schema format. - datreeio/CRDs-catalog
👍3🔥3❤1
The article clarifies the distinction between Platform Engineering (focused on velocity and Developer Experience/DevEx) and Site Reliability Engineering (focused on stability and production health). It argues that while their daily tasks differ, they must be integrated: Platform Engineers build the "golden paths" that abstract infrastructure complexity, while SREs ensure those paths are robust, scalable, and monitored.
https://octopus.com/devops/platform-engineering/platform-engineering-vs-sre/
https://octopus.com/devops/platform-engineering/platform-engineering-vs-sre/
Octopus
Platform Engineering versus SRE: 5 differences and working together
Platform Engineering and Site Reliability Engineering (SRE) are distinct but complementary approaches to improving software development and delivery, with SRE focusing on reliability and Platform Engineering focusing on developer experience and efficiency.
👍2
Today I read the article “What Would a Kubernetes 2.0 Look Like?”
Thoughts on what the next major version might be. And found this :)
and realized that Kubernetes developers had the same thoughts about using YAML- but instead of HCL, they just invented their own HCL-like language: KYAML.
Thoughts on what the next major version might be. And found this :)
YAML is just too much for what we're trying to do with k8s and it's not a safe enough format. Indentation is error-prone, the files don't scale great (you really don't want a super long YAML file), debugging can be annoying. YAML has so many subtle behaviors outlined in its spec.
HCL is already the format for Terraform, so at least we'd only have to hate one configuration language instead of two. It's strongly typed with explicit types. There's already good validation mechanisms. It is specifically designed to do the job that we are asking YAML to do and it's not much harder to read.
and realized that Kubernetes developers had the same thoughts about using YAML- but instead of HCL, they just invented their own HCL-like language: KYAML.
🤣14
Recently I searched for a simple solution that allows developers to be notified about changes in ConfigMaps. I tried to find a simple solution, and to my surprise, there is only one simple and straightforward solution that does only one thing, and that is Kubewatch. So, if you would like to have a simple solution that can be used for notifying about changes to objects in your K8s cluster, choose Kubewatch.
https://github.com/robusta-dev/kubewatch
https://github.com/robusta-dev/kubewatch
GitHub
GitHub - robusta-dev/kubewatch: Watch k8s events and trigger Handlers
Watch k8s events and trigger Handlers. Contribute to robusta-dev/kubewatch development by creating an account on GitHub.
👍3❤2
Anyone who has been on call at night knows that it's impossible to react within minutes and triage an incident fast enough, especially if you are in such cases very rarely. When you are paged once a quarter or a year, all your dashboards are outdated, your diagnostic skills are lacking, and your understanding of the system has already changed a great deal. In those cases, a current AI agent can be useful. Looking at these article, I see that by the time you get paged, wake up, turn on your laptop, and try to open your eyes, the agent can already triage the incident and provide a full report with recommendations. Yes, we still need a human to approve those changes or do them manually (as with planes, people prefer to see a live human pilot, but autopilots are already better than humans).
https://www.opsworker.ai/blog/agent-driven-sre-investigations-a-practical-deep-dive-into-multi-agent-incident-response/
https://www.opsworker.ai/blog/agent-driven-sre-investigations-a-practical-deep-dive-into-multi-agent-incident-response/
Blog | OpsWorker.ai — Insights on AIOps & AI SRE Automation
Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response
A team of AI SRE agents is investigating Kubernetes incidents using k8s & GitHub MCPs to find root causes and create remediation pull requests.
👍5🔥2🎉1
The article "What happens inside the Kubernetes API server?" has been updated. It is a good starting point for preparing for your next K8s job interview.
https://learnkube.com/kubernetes-api-explained
https://learnkube.com/kubernetes-api-explained
LearnKube
What happens inside the Kubernetes API server?
Learn how requests flow through the Kubernetes API server — from authentication to etcd storage.
👍5