DevOps&SRE Library

Two-node HA Kubernetes for edge computing cost savings

Our 2-node HA architecture uses Spectro Cloud’s existing, battle-tested edge solution, which builds upon open source components including kairos, k3s, kube-vip, harbor, and system-upgrader-controller.

https://itnext.io/two-node-ha-kubernetes-for-edge-computing-cost-savings-9a009eb076ac

3.77K views15:02

DevOps&SRE Library

Talos - An Immutable OS for Kubernetes

For some time now, I have been interested in Talos, an operating system for Kubernetes. I installed my first Talos cluster in November 2023, and my “production” (composed of 3 Raspberry Pi) is now running on this OS.

https://a-cup-of.coffee/blog/talos

3.88K views07:01

DevOps&SRE Library

Automating Deployments with FluxCD in AKS

https://gagovictor.medium.com/automating-deployments-with-fluxcd-in-aks-60c3814502bf

3.77K views15:00

DevOps&SRE Library

multus-cni

Multus CNI enables attaching multiple network interfaces to pods in Kubernetes.

https://github.com/k8snetworkplumbingwg/multus-cni

3.81K views07:00

DevOps&SRE Library

kube-startup-cpu-boost

Kube Startup CPU Boost is a controller that increases CPU resource requests and limits during Kubernetes workload startup time. Once the workload is up and running, the resources are set back to their original values.

https://github.com/google/kube-startup-cpu-boost

5.55K views15:02

DevOps&SRE Library

rbac-wizard

RBAC Wizard is a tool that helps you visualize and analyze the RBAC configurations of your Kubernetes cluster. It provides a graphical representation of the Kubernetes RBAC objects.

https://github.com/pehlicd/rbac-wizard

4.37K views07:02

DevOps&SRE Library

kubediff

Source VS Deployed

https://github.com/Ramilito/kubediff

3.95K views15:02

DevOps&SRE Library

cluster-template

A template for deploying a Talos Kubernetes cluster including Flux for GitOps

https://github.com/onedr0p/cluster-template

3.67K views07:01

DevOps&SRE Library

Hot Take: Don't provide incident resolution estimates

https://firehydrant.com/blog/hot-take-dont-provide-incident-resolution-estimates

3.46K views15:01

DevOps&SRE Library

Continuous reinvention: A brief history of block storage at AWS

Marc Olson has been part of the team shaping Elastic Block Store (EBS) for over a decade. In that time, he’s helped to drive the dramatic evolution of EBS from a simple block storage service relying on shared drives to a massive network storage system that delivers over 140 trillion daily operations.

In this post, Marc provides a fascinating insider’s perspective on the journey of EBS. He shares hard-won lessons in areas such as queueing theory, the importance of comprehensive instrumentation, and the value of incrementalism versus radical changes. Most importantly, he emphasizes how constraints can often breed creative solutions. It’s an insightful look at how one of AWS’s foundational services has evolved to meet the needs of our customers (and the pace at which they’re innovating).

https://www.allthingsdistributed.com/2024/08/continuous-reinvention-a-brief-history-of-block-storage-at-aws.html

3.57K views07:02

DevOps&SRE Library

Why I don’t like discussing action items during incident reviews

https://surfingcomplexity.blog/2024/09/28/why-i-dont-like-discussing-action-items-during-incident-reviews

3.88K views15:01

DevOps&SRE Library

Syncing PagerDuty Schedules to Slack Groups

https://www.honeycomb.io/blog/syncing-pagerduty-schedules-slack-groups

3.54K views07:01

DevOps&SRE Library

docmost

Open-source collaborative wiki and documentation software.

https://github.com/docmost/docmost

3.76K views15:01

DevOps&SRE Library

octopod

A UI for Docker Registries

https://github.com/frectonz/octopod

3.81K views07:00

DevOps&SRE Library

A Look at the New Prometheus 3.0 UI

The Prometheus Team has just announced Prometheus version 3.0 at PromCon, with an official blog post detailing all the exciting new changes and features. A very visible highlight of Prometheus 3.0 is its new web UI that is enabled by default.

I've worked on this complete UI rewrite over the last half year or so and am excited to finally see it in the hands of users. In this post, I'll take a closer look at my motivation for building the new UI and then explain what it brings in terms of features and stability caveats.

https://promlabs.com/blog/2024/09/11/a-look-at-the-new-prometheus-3-0-ui

4.54K views15:01

DevOps&SRE Library

hoop

Secure, seamless access to databases and servers.

https://github.com/hoophq/hoop

4.24K views07:00

DevOps&SRE Library

10 Examples Why cURL is an Awesome CLI Tool

Whether you're developer, DevOps engineer, SysAdmin, QA or in any other technical role, you're surely familiar with cURL - the command line tool and library for transferring data with URLs (as described in docs).

Most of the time however, we all really only use curl for simple tasks, such downloading a file or checking if website is accessible, yet there's some much more curl can do!

And in this article we will go through exactly those cool examples and tricks to showcase why curl is awesome and underappreciated tool...

https://martinheinz.dev/blog/113

4.82K views15:01

DevOps&SRE Library

Why I like discussing actions items in incident reviews

https://incident.io/blog/why-i-like-discussing-actions-items-in-incident-reviews

5.65K views07:01

DevOps&SRE Library

A Comprehensive Guide to Database Sharding: Building Scalable Systems

Explore an in-depth guide to database sharding: what it is, its types, how to select shard keys, and route queries for building scalable systems.

https://dzone.com/articles/a-comprehensive-guide-to-database-sharding

4.31K views15:02

DevOps&SRE Library

Improving platform resilience at Cloudflare through automation

Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.

When operating at Cloudflare’s scale, it is important to ensure that our platform is able to recover from faults seamlessly. It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling. In one word: toil; not a viable solution at our scale and rate of growth.

In this post we discuss how we built the foundations to enable a more scalable future, and what problems it has immediately allowed us to solve.

https://blog.cloudflare.com/improving-platform-resilience-at-cloudflare

4.28K views07:02

DevOps&SRE Library

dockcheck

CLI tool to automate docker image updates.

https://github.com/mag37/dockcheck

3.94K views15:02

About

Blog

Apps

Platform