awesome-limits
https://github.com/lorin/awesome-limits
Everything has limits, including software systems. When you hit these limits, bad things can happen.
You've probably hit memory and disk limits, but those aren't the only ones.
This page lists limits that, when breached, led to someone having a bad time. I tweeted about limits and got all sorts of interesting responses. This page contains some of them, with links to the tweets, which often contain more details.
https://github.com/lorin/awesome-limits
Tell me about a time…
https://surfingcomplexity.blog/2023/12/24/tell-me-about-a-time
Here are some proposed questions for interviewing someone for an SRE role. Really, these are just conversation starters to get them reflecting and discussing specific incident details.
https://surfingcomplexity.blog/2023/12/24/tell-me-about-a-time
Rebuilding Netflix Video Processing Pipeline with Microservices
https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359
https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359
10 Tips for Onboarding New SRE Hires
https://www.srepath.com/10-tips-for-onboarding-new-sre-hires
There’s more than one way to mess up your new SRE hire and get them stuck in a loop.
Here are 6 ways new hires will know you’ve made this mistake:
1. unclear role requirements
2. going too advanced too soon
3. not having any tangible, measurable things to do in the first few months
4. not feeling connected with the rest of the SRE team
5. no clarity on how SRE fits into the wider organization
6. little to no collaboration with teams outside of SRE
This article will unpack these 6 sticking points and show how to solve them.
https://www.srepath.com/10-tips-for-onboarding-new-sre-hires
Starting SRE at startups and smaller organizations
https://www.srepath.com/starting-sre-at-startups-and-smaller-organizations
Most of the original thinking behind SRE focuses on implementing it in large-scale systems.
I believe that any organization that has software at the foundation of its core business should at the very least pay attention to SRE principles.
You can always pare hyperscale ideas down to your level of need, which we will explore later in this article.
https://www.srepath.com/starting-sre-at-startups-and-smaller-organizations
Ansible vs Terraform: Choose One or Use Both?
https://www.env0.com/blog/ansible-vs-terraform-when-to-choose-one-or-use-them-together
https://www.env0.com/blog/ansible-vs-terraform-when-to-choose-one-or-use-them-together
AWS Extended EKS Support: A Costly Band-Aid for Kubernetes Clusters
https://medium.com/@talkimhi/aws-extended-eks-support-a-costly-band-aid-for-kubernetes-clusters-120b8d537abe
Amazon Web Services (AWS) recently announced extended support for Amazon Elastic Kubernetes Service (EKS) versions (starting April, 2024), allowing customers to use older versions of Kubernetes for an additional 12 months. While this may seem like a convenient option, it comes with a hefty price tag and several drawbacks that customers should carefully consider before opting for it.
https://medium.com/@talkimhi/aws-extended-eks-support-a-costly-band-aid-for-kubernetes-clusters-120b8d537abe
Key metrics for monitoring etcd
https://www.datadoghq.com/blog/etcd-key-metrics
Tools for collecting etcd metrics and logs
https://www.datadoghq.com/blog/etcd-monitoring-tools
https://www.datadoghq.com/blog/etcd-key-metrics
Tools for collecting etcd metrics and logs
https://www.datadoghq.com/blog/etcd-monitoring-tools
testkube
https://github.com/kubeshop/testkube
Testkube natively integrates test orchestration and execution into Kubernetes and your CI/CD/GitOps pipeline. It decouples test artifacts and execution from CI/CD tooling; tests are meant to be part of your clusters state and can be executed as needed:
- Kubectl plugin
- Externally triggered via API (CI, external tooling, etc)
- Automatically on deployment of annotated/labeled services/pods/etc (WIP)
Testkube advantages:
- Avoids vendor lock-in for test orchestration and execution in CI/CD pipelines
- Makes it easy to orchestrate and run any kind of tests - functional, load/performance, security, compliance, etc. in your clusters, without having to wrap them in docker-images or providing network access
- Makes it possible to decouple test execution from build processes; engineers should be able to run specific tests whenever needed
- Centralizes all test results in a consistent format for "actionable QA analytics"
- Provides a modular architecture for adding new types of tests and executors
https://github.com/kubeshop/testkube
Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend
https://doordash.engineering/2024/01/16/staying-in-the-zone-how-doordash-used-a-service-mesh-to-manage-data-transfer-reducing-hops-and-cloud-spend
There have been many benefits gained through DoorDash’s evolution from a monolithic application architecture to one that is based on cells and microservices. The new architecture has reduced the time required for development, test, and deployment and at the same time has improved scalability and resiliency for end-users including merchants, Dashers, and consumers. As the number of microservices and back-ends has grown, however, DoorDash has observed an uptick in cross-availability zone (AZ) data transfer costs. These data transfer costs — incurred on both send and receive — allow DoorDash to provide its end users a highly available service that can withstand degradations of one or more AZs.
The cost increase prompted our engineering team to investigate alternative ways to provide the same level of service more efficiently. In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.
https://doordash.engineering/2024/01/16/staying-in-the-zone-how-doordash-used-a-service-mesh-to-manage-data-transfer-reducing-hops-and-cloud-spend
The Single Pain of Glass
https://medium.com/site-reliability-engineering-leadership/the-single-pain-of-glass-6e42930e966
How do we create better dashboards?
https://medium.com/site-reliability-engineering-leadership/the-single-pain-of-glass-6e42930e966
The importance of SEV-1 call leaders
https://argoday.medium.com/sev-1-call-leaders-8fdc0ae5f6be
Incidents come in different shapes and sizes. The most severe incidents require special handling that is unlike their less-critical variants. These SEV-1 (aka CRITICAL) incidents can have material financial impact for a company and create a challenging environment for any incident commander creating a need for specially designated SEV-1 call leaders.
https://argoday.medium.com/sev-1-call-leaders-8fdc0ae5f6be
SRE Archetypes
https://blog.alexewerlof.com/p/sre-archetypes
Different hats that SRE's wear in the industry: Admin, Architect, Toolsmith, and firefighter
https://blog.alexewerlof.com/p/sre-archetypes
connect() - why are you so slow?
https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance
https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance
A Distributed Systems Reading List
https://ferd.ca/a-distributed-systems-reading-list.html
This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.
Since I was asked for resources again recently, I decided to pop this text into my blog. I have verified the links again and replaced those that broke with archive links or other ones, but have not sought alternative sources when the old links worked, nor taken the time to add any extra content for new material that may have been published since then.
It is meant to be used as a quick reference to understand various distsys discussions, and to discover the overall space and possibilities that are around this environment.
https://ferd.ca/a-distributed-systems-reading-list.html
Executing Cron Scripts Reliably At Scale
https://slack.engineering/executing-cron-scripts-reliably-at-scale
Cron scripts are responsible for critical Slack functionality. They ensure reminders execute on time, email notifications are sent, and databases are cleaned up, among other things. Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome. These issues lead us to design and build a better way to execute cron scripts reliably at scale.
https://slack.engineering/executing-cron-scripts-reliably-at-scale
HashiTalks 2024: Mastering Terraform Testing, a layered approach to testing complex infrastructure
https://mattias.engineer/posts/hashitalks-2024
https://mattias.engineer/posts/hashitalks-2024
Anti-patterns of using layers with Terraform
https://xebia.com/blog/anti-patterns-of-using-layers-with-terraform
In the ever-evolving landscape of cloud computing and infrastructure management, Terraform has emerged as a transformative force, empowering organizations to define, provision, and manage their infrastructure as code (IaC). The versatility of Terraform extends to its project structuring, offering a myriad of approaches, from modular designs and remote state management to utilizing workspaces and version control, providing users with a spectrum of options to tailor their IaC projects to diverse needs.
In Terraform, the typical organizational structure involves modules, which are reusable units of code that encapsulate infrastructure components. These modules can be composed to create more complex infrastructure. There isn’t a standardized or official concept referred to as “Terraform layers” in the Terraform documentation. However, the term has gained popularity in the Terraform community. If you search the tern “Terraform layers” various articles will pop up discussing the concept.
https://xebia.com/blog/anti-patterns-of-using-layers-with-terraform
Parsing Terraform for Forms
https://melvinkoh.me/parsing-terraform-for-forms-clr4zq4tu000309juab3r1lf7
Transforming Terraform Variable Types into JSON Schema
https://melvinkoh.me/parsing-terraform-for-forms-clr4zq4tu000309juab3r1lf7
Terraform Modules: From Development to Deployment on Gitlab
P0 - Introduction: https://medium.com/@vighnesh_prakash/terraform-modules-from-development-to-deployment-on-gitlab-9191b0aea673
P1 - Publishing Terraform Modules to Gitlab Infra Registry: https://medium.com/@vighnesh_prakash/publishing-terraform-modules-to-gitlab-infra-registry-a52755ebc712
P2 - Just Enough Gitlab: https://medium.com/@vighnesh_prakash/part-02-just-enough-gitlab-3145d0ee56e
P3 - Publishing Terraform Modules using GitLab Pipelines: https://medium.com/@vighnesh_prakash/part-03-publishing-terraform-modules-using-gitlab-pipelines-dc04186472c
P4 - Documenting Terraform Modules: https://medium.com/@vighnesh_prakash/part-04-documenting-terraform-modules-9e284d692d8
P5 - Release Strategy: https://medium.com/@vighnesh_prakash/part-05-release-strategy-80865327bf8d
P6 - Structuring Terraform Modules: https://medium.com/@vighnesh_prakash/part-06-structuring-terraform-modules-77747573c371
P0 - Introduction: https://medium.com/@vighnesh_prakash/terraform-modules-from-development-to-deployment-on-gitlab-9191b0aea673
P1 - Publishing Terraform Modules to Gitlab Infra Registry: https://medium.com/@vighnesh_prakash/publishing-terraform-modules-to-gitlab-infra-registry-a52755ebc712
P2 - Just Enough Gitlab: https://medium.com/@vighnesh_prakash/part-02-just-enough-gitlab-3145d0ee56e
P3 - Publishing Terraform Modules using GitLab Pipelines: https://medium.com/@vighnesh_prakash/part-03-publishing-terraform-modules-using-gitlab-pipelines-dc04186472c
P4 - Documenting Terraform Modules: https://medium.com/@vighnesh_prakash/part-04-documenting-terraform-modules-9e284d692d8
P5 - Release Strategy: https://medium.com/@vighnesh_prakash/part-05-release-strategy-80865327bf8d
P6 - Structuring Terraform Modules: https://medium.com/@vighnesh_prakash/part-06-structuring-terraform-modules-77747573c371