Diving deep into distributed microservices with OpenSearch and OpenTelemetry
https://opensearch.org/blog/diving-deep-into-distributed-microservices-with-opensearch-and-opentelemetry
https://opensearch.org/blog/diving-deep-into-distributed-microservices-with-opensearch-and-opentelemetry
Tracing GenAI Applications Is Not Enough
https://krisztianfekete.org/tracing-genai-applications-is-not-enough
https://krisztianfekete.org/tracing-genai-applications-is-not-enough
Top 10 Status Page Examples: What We Like and What’s Missing
https://www.checklyhq.com/blog/top-10-status-page-examples
https://www.checklyhq.com/blog/top-10-status-page-examples
Redesigning Workers KV for increased availability and faster performance
https://blog.cloudflare.com/rearchitecting-workers-kv-for-redundancy
https://blog.cloudflare.com/rearchitecting-workers-kv-for-redundancy
6 Reasons You Don't Need an SRE Team
https://log.andvari.net/6reasons.html
The model of large SRE teams covering many services in a vague and nebulous way that's open to repeated re-interpretation is mostly a side-effect of (a) cargo-culting the building of these large groups, or (b) retrofitting SRE/DevOps onto existing groups without the company-wide reliability focus required (or the fortitude to decide you didn't need such a large group to do SRE).
https://log.andvari.net/6reasons.html
Flame Charts: The Time-Aware Sibling of Flame Graphs
https://www.polarsignals.com/blog/posts/2025/05/28/flamecharts-the-time-aware-sibling-of-flame-graphs
https://www.polarsignals.com/blog/posts/2025/05/28/flamecharts-the-time-aware-sibling-of-flame-graphs
Why I recommend native Prometheus instrumentation over OpenTelemetry
https://promlabs.com/blog/2025/07/17/why-i-recommend-native-prometheus-instrumentation-over-opentelemetry
https://promlabs.com/blog/2025/07/17/why-i-recommend-native-prometheus-instrumentation-over-opentelemetry
Choosing the right OpenTelemetry Collector distribution
https://www.datadoghq.com/blog/otel-collector-distributions
https://www.datadoghq.com/blog/otel-collector-distributions
Setting Up OpenTelemetry on the Frontend Because I Hate Myself
https://thenewstack.io/setting-up-opentelemetry-on-the-frontend-because-i-hate-myself
Frontend developers deserve so much better from OpenTelemetry, especially since they stand to benefit so much.
https://thenewstack.io/setting-up-opentelemetry-on-the-frontend-because-i-hate-myself
How Meta keeps its AI hardware reliable
https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable
https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable
Achieving High Availability with distributed database on Kubernetes at Airbnb
https://medium.com/airbnb-engineering/achieving-high-availability-with-distributed-database-on-kubernetes-at-airbnb-58cc2e9856f4
We chose an innovative strategy of deploying a distributed database cluster across multiple Kubernetes clusters in a cloud environment. Although currently an uncommon design pattern due to its complexity, this strategy allowed us to achieve target system reliability and operability.
In this post, we’ll share how we overcame challenges and the best practices we’ve developed for this strategy and we believe these best practices should be applicable to any other strongly consistent, distributed storage systems.
https://medium.com/airbnb-engineering/achieving-high-availability-with-distributed-database-on-kubernetes-at-airbnb-58cc2e9856f4
Introducing Off-CPU Profiling
https://www.polarsignals.com/blog/posts/2025/07/30/introducing-off-cpu-profiling
How Off-CPU profiling works and how to get the most out of it
https://www.polarsignals.com/blog/posts/2025/07/30/introducing-off-cpu-profiling
1
FossFLOW
https://github.com/stan-smith/FossFLOW
FossFLOW is a powerful, open-source Progressive Web App (PWA) for creating beautiful isometric diagrams. Built with React and the Isoflow (Now forked and published to NPM as fossflow) library, it runs entirely in your browser with offline support.
https://github.com/stan-smith/FossFLOW
rotel
https://github.com/streamfold/rotel
Rotel provides an efficient, high-performance solution for collecting, processing, and exporting telemetry data. Rotel is ideal for resource-constrained environments and applications where minimizing overhead is critical.
https://github.com/streamfold/rotel
Can LLMs replace on call SREs today?
https://clickhouse.com/blog/llm-observability-challenge
There's a growing belief that AI-powered observability will soon reduce or even replace the role of Site Reliability Engineers (SREs). That's a bold claim---and at ClickHouse, we were curious to see how close we actually are.
https://clickhouse.com/blog/llm-observability-challenge
Cloudflare incident on August 21, 2025
https://blog.cloudflare.com/cloudflare-incident-on-august-21-2025
On August 21, 2025, an influx of traffic directed toward clients hosted in the Amazon Web Services (AWS) us-east-1 facility caused severe congestion on links between Cloudflare and AWS us-east-1. This impacted many users who were connecting to or receiving connections from Cloudflare via servers in AWS us-east-1 in the form of high latency, packet loss, and failures to origins.
Customers with origins in AWS us-east-1 began experiencing impact at 16:27 UTC. The impact was substantially reduced by 19:38 UTC, with intermittent latency increases continuing until 20:18 UTC.
This was a regional problem between Cloudflare and AWS us-east-1, and global Cloudflare services were not affected. The degradation in performance was limited to traffic between Cloudflare and AWS us-east-1. The incident was a result of a surge of traffic from a single customer that overloaded Cloudflare's links with AWS us-east-1. It was a network congestion event, not an attack or a BGP hijack.
We’re very sorry for this incident. In this post, we explain what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.
https://blog.cloudflare.com/cloudflare-incident-on-august-21-2025
Pooling Connections with RDS Proxy at Klaviyo
https://klaviyo.tech/pooling-connections-with-rds-proxy-at-klaviyo-e79e04120188
How we scale our databases with RDS Proxy
https://klaviyo.tech/pooling-connections-with-rds-proxy-at-klaviyo-e79e04120188
Availability Models
https://www.thecoder.cafe/p/availability-models
Because “Highly Available” Isn’t Saying Much
https://www.thecoder.cafe/p/availability-models