DevOps&SRE Library

Managing Over 6,000 Self-Hosted Databases Without a DBA — How a Single Engineer Leveraged KubeBlocks to Make It Possible

https://medium.com/@apecloud.info/managing-over-6-000-self-hosted-databases-without-a-dba-how-a-single-engineer-leveraged-95143fdd5c8f

2.99K views15:02

DevOps&SRE Library

Gracefully Terminating Pods in Kubernetes: Handling SIGTERM

https://jaadds.medium.com/gracefully-terminating-pods-in-kubernetes-handling-sigterm-fb0d60c7e983

2.96K views07:05

DevOps&SRE Library

Chaos testing a Postgres cluster managed by CloudNativePG

https://coroot.com/blog/engineering/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg

2.89K views15:03

DevOps&SRE Library

Exploring Istio: The Power of Service Mesh in Kubernetes

https://medium.com/@blogs4devs/exploring-istio-the-power-of-service-mesh-in-kubernetes-f8d6c8465c04

3.18K views07:03

DevOps&SRE Library

A Quick(ish) Introduction to Tuning Postgres

Most guides to the finer aspects of managing databases like Postgres are… not great. The Postgres documentation is well-written, but it has too much information for most developers. On the other hand, most online Postgres optimization guides are essentially a repeated version of: “Run this command. Got it? Cool.” This should provide you with a relatively brief introduction to Postgres tuning, focusing on the most important knobs, while also describing how these knobs relate to Postgres’s overall functioning and internals.

https://byteofdev.com/posts/tuning-postgres-intro

3.11K views14:05

DevOps&SRE Library

Avoiding the ironies of automation

We're using AI to build an agentic product that works collaboratively with responders to improve incident investigations and resolve incidents faster. A bold claim, I know, and I think pretty impressive to land the word “agentic” so early on—I promise it’s the last time I use it.

After six months of digging into this, I’m convinced: AI in incident response won’t just be helpful—it’ll be essential. As more software is built with, and increasingly by, AI, responders will have less and less context about the systems they’re operating. That shrinking understanding—combined with the ever-growing volume of software—only increases the need for tools that can assist.

Done right, there's a huge upside in this approach too—faster incident resolution, reduced customer impact, and less cognitive burden on the folks putting out the fires.

But with more automation comes a new shape of risk—much of which is captured in Lisanne Bainbridge’s 1983 paper, Ironies of automation. In the paper, Bainbridge explains that automation meant to help can paradoxically make things harder. As routine tasks get automated, human skills fade from lack of practice, so when the system fails (and they will!), responders are left underprepared and out of context.

Working in tech companies, I’m yet to see these risks materialise seriously, but there are definite elements of truth here. Count the number of Kubernetes incidents where operators have no idea what’s happening and you’ll get the gist.

https://incident.io/building-with-ai/avoiding-the-ironies-of-automation

3.51K views07:05

DevOps&SRE Library

Practical Problems with Auto-Increment

In this post I'm going to demonstrate 2 reasons I will be avoiding auto-increment fields in Postgres and MySQL in future. I'm going to prefer using UUID fields unless I have a very good reason not to.

https://samwho.dev/blog/practical-problems-with-auto-increment

4.74K views15:05

DevOps&SRE Library

Choosing Between Count and For-Each

Terraform has two looping mechanisms for creating multiple resources, count and for_each. The count meta-argument has been around for a long time, but for_each is a relative newcomer (introduced in version 0.12). Each meta-argument allows you to create more than one resource or module with a single configuration block.

https://nedinthecloud.com/2022/01/27/choosing-between-count-and-for-each

3.43K views07:00

DevOps&SRE Library

The Art of Not Getting Woken Up for Nothing

Strategies from SRE leaders fighting noisy alerts in complex system.

https://rootly.com/blog/the-art-of-not-getting-woken-up-for-nothing

3.55K views15:01

DevOps&SRE Library

s3grep

s3grep is a parallel CLI tool for searching logs and unstructured content in Amazon S3 buckets. It supports .gz decompression, progress bars, and robust error handling—making it ideal for cloud-native log analysis.

https://github.com/dacort/s3grep

3.14K views07:04

DevOps&SRE Library

Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

While migrating Pinterest’s search infrastructure — which powers core experiences for millions of users monthly — to Kubernetes, we faced a challenge in the new environment: one in every million search requests took 100x longer than usual.

This post chronicles our investigation, uncovering an elusive interaction between our memory-intensive search system and a seemingly innocent monitoring process. The journey involves profiling search systems, debugging performance issues, Linux kernel features, and memory management.

https://medium.com/pinterest-engineering/debugging-the-one-in-a-million-failure-migrating-pinterests-search-infrastructure-to-kubernetes-bef9af9dabf4

3.4K views15:06

DevOps&SRE Library

Can simple 4 Core, 16 GB RAM reach 1000 tps?

https://dev.to/djinn/can-simple-4-core-16-gb-ram-reach-1000-tps-5pl

3.17K views07:01

DevOps&SRE Library

sentinel

Multi-protocol service monitoring system with real-time alerts and web dashboard. Supports HTTP/HTTPS, TCP and gRPC monitoring with Telegram notifications.

https://github.com/sxwebdev/sentinel

3.39K views14:04

DevOps&SRE Library

How We Saved $1.22 Million Annually on GCP Costs in a Few Simple Steps

https://medium.com/@ofekatr1el/how-we-saved-1-22-million-annually-on-gcp-costs-in-a-few-simple-steps-3f99ba3ba0ae

3.09K views15:05

DevOps&SRE Library

Inside Kubernetes Scheduler: What Really Happens Before Your Pod Lands on a Node

https://medium.com/@hmusicofficial27/inside-kubernetes-scheduler-what-really-happens-before-your-pod-lands-on-a-node-99e9aeb829a1

4.17K views07:03

DevOps&SRE Library

Overcoming the downsides of mutating webhooks: Our journey to an alternative

UiPath Automation Suite has many services that communicate using FQDN (Fully Qualified Domain Name). As this suite operates on the premises of our customers, it provides them with the freedom to select their own FQDN. Often, the certificate required for their chosen FQDN is not signed by a known authority. To talk securely using the HTTPS protocol, all the services must trust the FQDN’s certificate. However, these services are owned by multiple teams. Asking each team to handle this individually is cumbersome and makes managing future certificate trust requests more challenging.

https://engineering.uipath.com/overcoming-the-downsides-of-mutating-webhooks-our-journey-to-an-alternative-5b0fbea83c59

2.98K views15:04

DevOps&SRE Library

Scaling Batch Jobs for Reliable and Efficient Processing

https://engineering.traderepublic.com/scaling-batch-jobs-for-reliable-and-efficient-processing-da6242cdb9f9

3.11K views07:02

DevOps&SRE Library

Optimizing Distributed Tracing with Jaeger DaemonSet: A Comprehensive Guide to Log Collection

https://medium.datadriveninvestor.com/optimizing-distributed-tracing-with-jaeger-daemonset-a-comprehensive-guide-to-log-collection-1963cebee37

3.76K views15:06

DevOps&SRE Library

Submariner Lighthouse: Multi-Cluster Service Discovery for Kubernetes

https://dev.to/reoring/submariner-lighthouse-multi-cluster-service-discovery-for-kubernetes-4fj7

3.66K views07:05

DevOps&SRE Library

HAMi

HAMi, formerly known as 'k8s-vGPU-scheduler', is a Heterogeneous device management middleware for Kubernetes. It can manage different types of heterogeneous devices (like GPU, NPU, etc.), share heterogeneous devices among pods, make better scheduling decisions based on topology of devices and scheduling policies.

https://github.com/Project-HAMi/HAMi

3.45K views15:02

DevOps&SRE Library

From Linux Primitives to Kubernetes Security Contexts

In Kubernetes, containers typically start with root privileges.

This happens because, by default, container processes run as UID 0 unless overridden.

Kubernetes does not impose a non-root policy; it inherits whatever the image defines.

This isn't a bug, it's a design choice carried over from Docker.

While convenient during development, it introduces unnecessary risk in production environments.

If an attacker compromises the container, root access increases the likelihood of privilege escalation to the host.

The Kubernetes API offers several ways to restrict container privileges using the Security Context.

With it, you can control the user a container runs as, manage Linux capabilities, enforce read-only filesystems, and block privilege escalation.

However, despite its importance, Security Contexts are often misunderstood or misapplied.

Many teams discover these controls only after a security audit or scanner flags a running container.

The next steps are usually reactively patching the config, suppressing the warning and moving on.

Before we get into Kubernetes SecurityContexts, we need to understand what they're actually configuring under the hood.

https://learnkube.com/security-contexts

4K views07:04

About

Blog

Apps

Platform