DevOps&SRE Library

How we discovered, and recovered from, Postgres corruption on the matrix.org homeserver

https://matrix.org/blog/2025/07/postgres-corruption-postmortem

2.75K views15:03

DevOps&SRE Library

marchat

Terminal-based group chat app with real-time WebSocket messaging, file sharing, themes, and admin tools — built with Go and Bubble Tea.

https://github.com/Cod-e-Codes/marchat

2.86K views07:01

DevOps&SRE Library

tududi

Self-hosted task management that combines the simplicity of personal with the power of professional project organization. Built for individuals and teams who value privacy, control, and efficiency.

https://github.com/chrisvel/tududi

3.28K views15:05

DevOps&SRE Library

Warmup Your Pods Using Istio

https://medium.com/blablacar/warmup-your-pods-using-istio-5249ec68f0e9

2.9K views07:02

DevOps&SRE Library

When VerticalPodAutoscaler Goes Rogue: How an Autoscaler Took Down Our Cluster

https://medium.com/learnings-from-the-paas/when-verticalpodautoscaler-goes-rogue-how-an-autoscaler-took-down-our-cluster-8c7479d5be3c

3.11K views15:06

DevOps&SRE Library

On How We Moved to Kubernetes

Have you heard of Kubernetes (also known as k8s)? Until a few months back, I knew it existed and that it was like infrastructure’s holy grail. It has to cover the basics, like auto-scaling and load balancing or automated rollbacks… And then there are millions of tools to build on top of it.

As we recently migrated our deployment from AWS Elastic Container Service (ECS) to AWS Elastic Kubernetes Service (EKS; managed Kubernetes cluster), I wanted to share some tips. It also feels nice to do that on the 10th anniversary of “Kubernetes: The Future of Cloud Hosting” MeteorHack’s blog post.

Please keep in mind that a Kubernetes cluster is an extremely complex beast, and I’m pretty far from being able to explain all the “whys” you may have. Our amazing DevOps Engineer managed to make it work, and I’m really happy with the current setup. Both because the app performs better at a lower cost and because I learned a lot along the way.

https://radekmie.dev/blog/on-how-we-moved-to-kubernetes

3.57K views07:02

DevOps&SRE Library

Split Cost Allocation Data for Amazon EKS

https://medium.com/@hirsch.elad/split-cost-allocation-data-for-amazon-eks-deb59dbd344a

3.44K views15:05

DevOps&SRE Library

EKS vs. GKE Networking

https://jason-umiker.medium.com/eks-vs-gke-networking-e1dd397fe86d

3.36K views07:00

DevOps&SRE Library

Container Network Interface (CNI) in Kubernetes: An Introduction

https://itnext.io/container-network-interface-cni-in-kubernetes-an-introduction-6cd453b622bd

3.22K views15:03

DevOps&SRE Library

Improve performance of memory intensive applications on EKS cluster using Huge Pages!

https://medium.com/@jainshubham0403/improve-performance-of-memory-intensive-applications-on-eks-cluster-using-huge-pages-44ba7a25f4b1

2.86K views07:04

DevOps&SRE Library

kubermatic

Kubermatic Kubernetes Platform is in an open source project to centrally manage the global automation of thousands of Kubernetes clusters across multicloud, on-prem and edge with unparalleled density and resilience.

https://github.com/kubermatic/kubermatic

2.57K views15:04

DevOps&SRE Library

Calico eBPF Source IP Preservation: The Unexpected Story of High Tail Latency

https://medium.com/@TigeraCalico/calico-ebpf-source-ip-preservation-the-unexpected-story-of-high-tail-latency-b6046ac7de1a

2.78K views07:04

DevOps&SRE Library

Understanding the Impact of externalTrafficPolicy on Kubernetes Services

https://medium.com/@zghanem/understanding-the-impact-of-externaltrafficpolicy-on-kubernetes-services-4f4426cb1246

3.17K views15:04

DevOps&SRE Library

Continuous Promotion on Kubernetes with GitOps

This article will teach you how to continuously promote application releases between environments on Kubernetes using the GitOps approach.

https://piotrminkowski.com/2025/01/14/continuous-promotion-on-kubernetes-with-gitops

3.14K views07:03

DevOps&SRE Library

Managing Over 6,000 Self-Hosted Databases Without a DBA — How a Single Engineer Leveraged KubeBlocks to Make It Possible

https://medium.com/@apecloud.info/managing-over-6-000-self-hosted-databases-without-a-dba-how-a-single-engineer-leveraged-95143fdd5c8f

2.8K views15:02

DevOps&SRE Library

Gracefully Terminating Pods in Kubernetes: Handling SIGTERM

https://jaadds.medium.com/gracefully-terminating-pods-in-kubernetes-handling-sigterm-fb0d60c7e983

2.76K views07:05

DevOps&SRE Library

Chaos testing a Postgres cluster managed by CloudNativePG

https://coroot.com/blog/engineering/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg

2.7K views15:03

DevOps&SRE Library

Exploring Istio: The Power of Service Mesh in Kubernetes

https://medium.com/@blogs4devs/exploring-istio-the-power-of-service-mesh-in-kubernetes-f8d6c8465c04

2.96K views07:03

DevOps&SRE Library

A Quick(ish) Introduction to Tuning Postgres

Most guides to the finer aspects of managing databases like Postgres are… not great. The Postgres documentation is well-written, but it has too much information for most developers. On the other hand, most online Postgres optimization guides are essentially a repeated version of: “Run this command. Got it? Cool.” This should provide you with a relatively brief introduction to Postgres tuning, focusing on the most important knobs, while also describing how these knobs relate to Postgres’s overall functioning and internals.

https://byteofdev.com/posts/tuning-postgres-intro

2.92K views14:05

DevOps&SRE Library

Avoiding the ironies of automation

We're using AI to build an agentic product that works collaboratively with responders to improve incident investigations and resolve incidents faster. A bold claim, I know, and I think pretty impressive to land the word “agentic” so early on—I promise it’s the last time I use it.

After six months of digging into this, I’m convinced: AI in incident response won’t just be helpful—it’ll be essential. As more software is built with, and increasingly by, AI, responders will have less and less context about the systems they’re operating. That shrinking understanding—combined with the ever-growing volume of software—only increases the need for tools that can assist.

Done right, there's a huge upside in this approach too—faster incident resolution, reduced customer impact, and less cognitive burden on the folks putting out the fires.

But with more automation comes a new shape of risk—much of which is captured in Lisanne Bainbridge’s 1983 paper, Ironies of automation. In the paper, Bainbridge explains that automation meant to help can paradoxically make things harder. As routine tasks get automated, human skills fade from lack of practice, so when the system fails (and they will!), responders are left underprepared and out of context.

Working in tech companies, I’m yet to see these risks materialise seriously, but there are definite elements of truth here. Count the number of Kubernetes incidents where operators have no idea what’s happening and you’ll get the gist.

https://incident.io/building-with-ai/avoiding-the-ironies-of-automation

3.35K views07:05

DevOps&SRE Library

Practical Problems with Auto-Increment

In this post I'm going to demonstrate 2 reasons I will be avoiding auto-increment fields in Postgres and MySQL in future. I'm going to prefer using UUID fields unless I have a very good reason not to.

https://samwho.dev/blog/practical-problems-with-auto-increment

4.54K views15:05

About

Blog

Apps

Platform