Can LLMs replace on call SREs today?
https://clickhouse.com/blog/llm-observability-challenge
There's a growing belief that AI-powered observability will soon reduce or even replace the role of Site Reliability Engineers (SREs). That's a bold claim---and at ClickHouse, we were curious to see how close we actually are.
https://clickhouse.com/blog/llm-observability-challenge
Cloudflare incident on August 21, 2025
https://blog.cloudflare.com/cloudflare-incident-on-august-21-2025
On August 21, 2025, an influx of traffic directed toward clients hosted in the Amazon Web Services (AWS) us-east-1 facility caused severe congestion on links between Cloudflare and AWS us-east-1. This impacted many users who were connecting to or receiving connections from Cloudflare via servers in AWS us-east-1 in the form of high latency, packet loss, and failures to origins.
Customers with origins in AWS us-east-1 began experiencing impact at 16:27 UTC. The impact was substantially reduced by 19:38 UTC, with intermittent latency increases continuing until 20:18 UTC.
This was a regional problem between Cloudflare and AWS us-east-1, and global Cloudflare services were not affected. The degradation in performance was limited to traffic between Cloudflare and AWS us-east-1. The incident was a result of a surge of traffic from a single customer that overloaded Cloudflare's links with AWS us-east-1. It was a network congestion event, not an attack or a BGP hijack.
We’re very sorry for this incident. In this post, we explain what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.
https://blog.cloudflare.com/cloudflare-incident-on-august-21-2025
Pooling Connections with RDS Proxy at Klaviyo
https://klaviyo.tech/pooling-connections-with-rds-proxy-at-klaviyo-e79e04120188
How we scale our databases with RDS Proxy
https://klaviyo.tech/pooling-connections-with-rds-proxy-at-klaviyo-e79e04120188
Availability Models
https://www.thecoder.cafe/p/availability-models
Because “Highly Available” Isn’t Saying Much
https://www.thecoder.cafe/p/availability-models
When “Anti-Patterns” Become Best Practice: Lessons from Migrating a Global Pub/Sub Empire to Kubernetes
https://bitbucket90.com/when-anti-patterns-become-best-practice-lessons-from-migrating-a-global-pub-sub-empire-to-k8s-c3dbcebdca68
How architecting for scale taught us that sometimes breaking the rules is exactly what the business needs
https://bitbucket90.com/when-anti-patterns-become-best-practice-lessons-from-migrating-a-global-pub-sub-empire-to-k8s-c3dbcebdca68
Digging Deeper: How Pause containers skew your Kubernetes CPU/Memory Metrics
https://medium.com/@amolsingh.singh23/digging-deeper-how-pause-containers-skew-your-kubernetes-cpu-memory-metrics-c50f3832cbe0
https://medium.com/@amolsingh.singh23/digging-deeper-how-pause-containers-skew-your-kubernetes-cpu-memory-metrics-c50f3832cbe0
Kubernetes Services: A Deep Dive with Examples
https://sheakimran.hashnode.dev/kubernetes-services-a-deep-dive-with-examples
https://sheakimran.hashnode.dev/kubernetes-services-a-deep-dive-with-examples
How We Cut Our Azure Cloud Costs by 3x
https://igoryerm.medium.com/how-we-cut-our-azure-cloud-costs-by-3-solda-ais-experience-212de2fc0375
https://igoryerm.medium.com/how-we-cut-our-azure-cloud-costs-by-3-solda-ais-experience-212de2fc0375
How to easily migrate ingress to gateway API in Kubernetes
https://medium.com/@kkrzywicki/how-to-easily-migrate-ingress-to-gateway-api-1d479639c43e
https://medium.com/@kkrzywicki/how-to-easily-migrate-ingress-to-gateway-api-1d479639c43e
Key Learnings from Creating Multi-Tenant GKE Clusters on Google Cloud with Thousands of Publicly Addressable Services
https://medium.com/google-cloud/key-learnings-from-creating-multi-tenant-gke-clusters-on-google-cloud-with-thousands-of-publicly-ea27d7bcd651
https://medium.com/google-cloud/key-learnings-from-creating-multi-tenant-gke-clusters-on-google-cloud-with-thousands-of-publicly-ea27d7bcd651
End-to-End DAG Testing in Airflow, Minus the Kubernetes Headache
https://medium.com/@eladpardilov/end-to-end-dag-testing-in-airflow-minus-the-kubernetes-headache-93b20bfa48a0
https://medium.com/@eladpardilov/end-to-end-dag-testing-in-airflow-minus-the-kubernetes-headache-93b20bfa48a0
kubectl-explore
https://github.com/keisku/kubectl-explore
A better kubectl explain with the fuzzy finder
https://github.com/keisku/kubectl-explore
helm-cel
https://github.com/idsulik/helm-cel
A Helm plugin that uses Common Expression Language (CEL) to validate values. Instead of using JSON Schema in values.schema.json, you can write more expressive validation rules using CEL in values.cel.yaml.
https://github.com/idsulik/helm-cel
kubechecks
https://github.com/zapier/kubechecks
kubechecks allows users of Github and Gitlab to see exactly what their changes will affect on their current ArgoCD deployments, as well as automatically run various conformance test suites prior to merge.
https://github.com/zapier/kubechecks
Implementing Karpenter In EKS (From Start To Finish)
https://www.cloudnativedeepdive.com/implementing-karpenter-in-eks-from-start-to-finish
https://www.cloudnativedeepdive.com/implementing-karpenter-in-eks-from-start-to-finish
Auto-scaling GitHub Actions on Kubernetes with Actions Runner Controller (ARC) & Terraform
https://blog.devgenius.io/auto-scaling-github-actions-on-kubernetes-with-actions-runner-controller-arc-terraform-ca9d651c08d8
https://blog.devgenius.io/auto-scaling-github-actions-on-kubernetes-with-actions-runner-controller-arc-terraform-ca9d651c08d8
gonzo
https://github.com/control-theory/gonzo
A powerful, real-time log analysis terminal UI inspired by k9s. Analyze log streams with beautiful charts, AI-powered insights, and advanced filtering - all from your terminal.
https://github.com/control-theory/gonzo
sbnb
https://github.com/sbnb-io/sbnb
Sbnb Linux is a revolutionary minimalist Linux distribution designed to boot bare-metal servers and enable remote connections through fast tunnels. It is ideal for environments ranging from home labs to distributed data centers. Sbnb Linux is simplified, automated, and resilient to power outages, supporting confidential computing to ensure secure operations in untrusted locations.
https://github.com/sbnb-io/sbnb