Two-node HA Kubernetes for edge computing cost savings
https://itnext.io/two-node-ha-kubernetes-for-edge-computing-cost-savings-9a009eb076ac
Our 2-node HA architecture uses Spectro Cloud’s existing, battle-tested edge solution, which builds upon open source components including kairos, k3s, kube-vip, harbor, and system-upgrader-controller.
https://itnext.io/two-node-ha-kubernetes-for-edge-computing-cost-savings-9a009eb076ac
1
Talos - An Immutable OS for Kubernetes
https://a-cup-of.coffee/blog/talos
For some time now, I have been interested in Talos, an operating system for Kubernetes. I installed my first Talos cluster in November 2023, and my “production” (composed of 3 Raspberry Pi) is now running on this OS.
https://a-cup-of.coffee/blog/talos
1
Automating Deployments with FluxCD in AKS
https://gagovictor.medium.com/automating-deployments-with-fluxcd-in-aks-60c3814502bf
https://gagovictor.medium.com/automating-deployments-with-fluxcd-in-aks-60c3814502bf
1
multus-cni
https://github.com/k8snetworkplumbingwg/multus-cni
Multus CNI enables attaching multiple network interfaces to pods in Kubernetes.
https://github.com/k8snetworkplumbingwg/multus-cni
1
kube-startup-cpu-boost
https://github.com/google/kube-startup-cpu-boost
Kube Startup CPU Boost is a controller that increases CPU resource requests and limits during Kubernetes workload startup time. Once the workload is up and running, the resources are set back to their original values.
https://github.com/google/kube-startup-cpu-boost
4
rbac-wizard
https://github.com/pehlicd/rbac-wizard
RBAC Wizard is a tool that helps you visualize and analyze the RBAC configurations of your Kubernetes cluster. It provides a graphical representation of the Kubernetes RBAC objects.
https://github.com/pehlicd/rbac-wizard
1
cluster-template
https://github.com/onedr0p/cluster-template
A template for deploying a Talos Kubernetes cluster including Flux for GitOps
https://github.com/onedr0p/cluster-template
1
Hot Take: Don't provide incident resolution estimates
https://firehydrant.com/blog/hot-take-dont-provide-incident-resolution-estimates
https://firehydrant.com/blog/hot-take-dont-provide-incident-resolution-estimates
1
Continuous reinvention: A brief history of block storage at AWS
https://www.allthingsdistributed.com/2024/08/continuous-reinvention-a-brief-history-of-block-storage-at-aws.html
Marc Olson has been part of the team shaping Elastic Block Store (EBS) for over a decade. In that time, he’s helped to drive the dramatic evolution of EBS from a simple block storage service relying on shared drives to a massive network storage system that delivers over 140 trillion daily operations.
In this post, Marc provides a fascinating insider’s perspective on the journey of EBS. He shares hard-won lessons in areas such as queueing theory, the importance of comprehensive instrumentation, and the value of incrementalism versus radical changes. Most importantly, he emphasizes how constraints can often breed creative solutions. It’s an insightful look at how one of AWS’s foundational services has evolved to meet the needs of our customers (and the pace at which they’re innovating).
https://www.allthingsdistributed.com/2024/08/continuous-reinvention-a-brief-history-of-block-storage-at-aws.html
1
Why I don’t like discussing action items during incident reviews
https://surfingcomplexity.blog/2024/09/28/why-i-dont-like-discussing-action-items-during-incident-reviews
https://surfingcomplexity.blog/2024/09/28/why-i-dont-like-discussing-action-items-during-incident-reviews
1
Syncing PagerDuty Schedules to Slack Groups
https://www.honeycomb.io/blog/syncing-pagerduty-schedules-slack-groups
https://www.honeycomb.io/blog/syncing-pagerduty-schedules-slack-groups
1
1
1
A Look at the New Prometheus 3.0 UI
https://promlabs.com/blog/2024/09/11/a-look-at-the-new-prometheus-3-0-ui
The Prometheus Team has just announced Prometheus version 3.0 at PromCon, with an official blog post detailing all the exciting new changes and features. A very visible highlight of Prometheus 3.0 is its new web UI that is enabled by default.
I've worked on this complete UI rewrite over the last half year or so and am excited to finally see it in the hands of users. In this post, I'll take a closer look at my motivation for building the new UI and then explain what it brings in terms of features and stability caveats.
https://promlabs.com/blog/2024/09/11/a-look-at-the-new-prometheus-3-0-ui
1
1
10 Examples Why cURL is an Awesome CLI Tool
https://martinheinz.dev/blog/113
Whether you're developer, DevOps engineer, SysAdmin, QA or in any other technical role, you're surely familiar with cURL - the command line tool and library for transferring data with URLs (as described in docs).
Most of the time however, we all really only use curl for simple tasks, such downloading a file or checking if website is accessible, yet there's some much more curl can do!
And in this article we will go through exactly those cool examples and tricks to showcase why curl is awesome and underappreciated tool...
https://martinheinz.dev/blog/113
1
Why I like discussing actions items in incident reviews
https://incident.io/blog/why-i-like-discussing-actions-items-in-incident-reviews
https://incident.io/blog/why-i-like-discussing-actions-items-in-incident-reviews
1
A Comprehensive Guide to Database Sharding: Building Scalable Systems
https://dzone.com/articles/a-comprehensive-guide-to-database-sharding
Explore an in-depth guide to database sharding: what it is, its types, how to select shard keys, and route queries for building scalable systems.
https://dzone.com/articles/a-comprehensive-guide-to-database-sharding
1
Improving platform resilience at Cloudflare through automation
https://blog.cloudflare.com/improving-platform-resilience-at-cloudflare
Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.
When operating at Cloudflare’s scale, it is important to ensure that our platform is able to recover from faults seamlessly. It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling. In one word: toil; not a viable solution at our scale and rate of growth.
In this post we discuss how we built the foundations to enable a more scalable future, and what problems it has immediately allowed us to solve.
https://blog.cloudflare.com/improving-platform-resilience-at-cloudflare
1
1