DevOps&SRE Library

How Meta keeps its AI hardware reliable

https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable

3.02K views07:00

DevOps&SRE Library

Achieving High Availability with distributed database on Kubernetes at Airbnb

We chose an innovative strategy of deploying a distributed database cluster across multiple Kubernetes clusters in a cloud environment. Although currently an uncommon design pattern due to its complexity, this strategy allowed us to achieve target system reliability and operability.

In this post, we’ll share how we overcame challenges and the best practices we’ve developed for this strategy and we believe these best practices should be applicable to any other strongly consistent, distributed storage systems.

https://medium.com/airbnb-engineering/achieving-high-availability-with-distributed-database-on-kubernetes-at-airbnb-58cc2e9856f4

3.52K views15:00

DevOps&SRE Library

Introducing Off-CPU Profiling

How Off-CPU profiling works and how to get the most out of it

https://www.polarsignals.com/blog/posts/2025/07/30/introducing-off-cpu-profiling

3.33K views07:00

DevOps&SRE Library

FossFLOW

FossFLOW is a powerful, open-source Progressive Web App (PWA) for creating beautiful isometric diagrams. Built with React and the Isoflow (Now forked and published to NPM as fossflow) library, it runs entirely in your browser with offline support.

https://github.com/stan-smith/FossFLOW

3.68K views15:02

DevOps&SRE Library

rotel

Rotel provides an efficient, high-performance solution for collecting, processing, and exporting telemetry data. Rotel is ideal for resource-constrained environments and applications where minimizing overhead is critical.

https://github.com/streamfold/rotel

2.97K views07:02

DevOps&SRE Library

Can LLMs replace on call SREs today?

There's a growing belief that AI-powered observability will soon reduce or even replace the role of Site Reliability Engineers (SREs). That's a bold claim---and at ClickHouse, we were curious to see how close we actually are.

https://clickhouse.com/blog/llm-observability-challenge

3.12K views15:05

DevOps&SRE Library

Cloudflare incident on August 21, 2025

On August 21, 2025, an influx of traffic directed toward clients hosted in the Amazon Web Services (AWS) us-east-1 facility caused severe congestion on links between Cloudflare and AWS us-east-1. This impacted many users who were connecting to or receiving connections from Cloudflare via servers in AWS us-east-1 in the form of high latency, packet loss, and failures to origins.

Customers with origins in AWS us-east-1 began experiencing impact at 16:27 UTC. The impact was substantially reduced by 19:38 UTC, with intermittent latency increases continuing until 20:18 UTC.

This was a regional problem between Cloudflare and AWS us-east-1, and global Cloudflare services were not affected. The degradation in performance was limited to traffic between Cloudflare and AWS us-east-1. The incident was a result of a surge of traffic from a single customer that overloaded Cloudflare's links with AWS us-east-1. It was a network congestion event, not an attack or a BGP hijack.

We’re very sorry for this incident. In this post, we explain what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.

https://blog.cloudflare.com/cloudflare-incident-on-august-21-2025

3.52K views07:01

DevOps&SRE Library

Pooling Connections with RDS Proxy at Klaviyo

How we scale our databases with RDS Proxy

https://klaviyo.tech/pooling-connections-with-rds-proxy-at-klaviyo-e79e04120188

3.25K views15:05

DevOps&SRE Library

Availability Models

Because “Highly Available” Isn’t Saying Much

https://www.thecoder.cafe/p/availability-models

2.92K views07:04

DevOps&SRE Library

SLI Evolution Stages

https://blog.alexewerlof.com/p/sli-evolution-stages

3.57K views15:05

DevOps&SRE Library

When “Anti-Patterns” Become Best Practice: Lessons from Migrating a Global Pub/Sub Empire to Kubernetes

How architecting for scale taught us that sometimes breaking the rules is exactly what the business needs

https://bitbucket90.com/when-anti-patterns-become-best-practice-lessons-from-migrating-a-global-pub-sub-empire-to-k8s-c3dbcebdca68

3.68K views07:00

DevOps&SRE Library

Digging Deeper: How Pause containers skew your Kubernetes CPU/Memory Metrics

https://medium.com/@amolsingh.singh23/digging-deeper-how-pause-containers-skew-your-kubernetes-cpu-memory-metrics-c50f3832cbe0

3.68K views15:04

DevOps&SRE Library

Kubernetes Services: A Deep Dive with Examples

https://sheakimran.hashnode.dev/kubernetes-services-a-deep-dive-with-examples

3.59K views07:04

DevOps&SRE Library

How We Cut Our Azure Cloud Costs by 3x

https://igoryerm.medium.com/how-we-cut-our-azure-cloud-costs-by-3-solda-ais-experience-212de2fc0375

3.28K views15:04

DevOps&SRE Library

How to easily migrate ingress to gateway API in Kubernetes

https://medium.com/@kkrzywicki/how-to-easily-migrate-ingress-to-gateway-api-1d479639c43e

2.95K views07:01

DevOps&SRE Library

Key Learnings from Creating Multi-Tenant GKE Clusters on Google Cloud with Thousands of Publicly Addressable Services

https://medium.com/google-cloud/key-learnings-from-creating-multi-tenant-gke-clusters-on-google-cloud-with-thousands-of-publicly-ea27d7bcd651

2.79K views15:02

DevOps&SRE Library

End-to-End DAG Testing in Airflow, Minus the Kubernetes Headache

https://medium.com/@eladpardilov/end-to-end-dag-testing-in-airflow-minus-the-kubernetes-headache-93b20bfa48a0

2.78K views07:04

DevOps&SRE Library

kubectl-explore

A better kubectl explain with the fuzzy finder

https://github.com/keisku/kubectl-explore

2.86K views15:06

DevOps&SRE Library

helm-cel

A Helm plugin that uses Common Expression Language (CEL) to validate values. Instead of using JSON Schema in values.schema.json, you can write more expressive validation rules using CEL in values.cel.yaml.

https://github.com/idsulik/helm-cel

3.05K views07:03

DevOps&SRE Library

kubechecks

kubechecks allows users of Github and Gitlab to see exactly what their changes will affect on their current ArgoCD deployments, as well as automatically run various conformance test suites prior to merge.

https://github.com/zapier/kubechecks

3.5K views15:05

DevOps&SRE Library

Using Terraform with GitLab

https://scalr.com/learning-center/using-terraform-with-gitlab

3.26K views07:03

About

Blog

Apps

Platform