Executing Cron Scripts Reliably At Scale
https://slack.engineering/executing-cron-scripts-reliably-at-scale
Cron scripts are responsible for critical Slack functionality. They ensure reminders execute on time, email notifications are sent, and databases are cleaned up, among other things. Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome. These issues lead us to design and build a better way to execute cron scripts reliably at scale.
https://slack.engineering/executing-cron-scripts-reliably-at-scale
HashiTalks 2024: Mastering Terraform Testing, a layered approach to testing complex infrastructure
https://mattias.engineer/posts/hashitalks-2024
https://mattias.engineer/posts/hashitalks-2024
Anti-patterns of using layers with Terraform
https://xebia.com/blog/anti-patterns-of-using-layers-with-terraform
In the ever-evolving landscape of cloud computing and infrastructure management, Terraform has emerged as a transformative force, empowering organizations to define, provision, and manage their infrastructure as code (IaC). The versatility of Terraform extends to its project structuring, offering a myriad of approaches, from modular designs and remote state management to utilizing workspaces and version control, providing users with a spectrum of options to tailor their IaC projects to diverse needs.
In Terraform, the typical organizational structure involves modules, which are reusable units of code that encapsulate infrastructure components. These modules can be composed to create more complex infrastructure. There isn’t a standardized or official concept referred to as “Terraform layers” in the Terraform documentation. However, the term has gained popularity in the Terraform community. If you search the tern “Terraform layers” various articles will pop up discussing the concept.
https://xebia.com/blog/anti-patterns-of-using-layers-with-terraform
Parsing Terraform for Forms
https://melvinkoh.me/parsing-terraform-for-forms-clr4zq4tu000309juab3r1lf7
Transforming Terraform Variable Types into JSON Schema
https://melvinkoh.me/parsing-terraform-for-forms-clr4zq4tu000309juab3r1lf7
Terraform Modules: From Development to Deployment on Gitlab
P0 - Introduction: https://medium.com/@vighnesh_prakash/terraform-modules-from-development-to-deployment-on-gitlab-9191b0aea673
P1 - Publishing Terraform Modules to Gitlab Infra Registry: https://medium.com/@vighnesh_prakash/publishing-terraform-modules-to-gitlab-infra-registry-a52755ebc712
P2 - Just Enough Gitlab: https://medium.com/@vighnesh_prakash/part-02-just-enough-gitlab-3145d0ee56e
P3 - Publishing Terraform Modules using GitLab Pipelines: https://medium.com/@vighnesh_prakash/part-03-publishing-terraform-modules-using-gitlab-pipelines-dc04186472c
P4 - Documenting Terraform Modules: https://medium.com/@vighnesh_prakash/part-04-documenting-terraform-modules-9e284d692d8
P5 - Release Strategy: https://medium.com/@vighnesh_prakash/part-05-release-strategy-80865327bf8d
P6 - Structuring Terraform Modules: https://medium.com/@vighnesh_prakash/part-06-structuring-terraform-modules-77747573c371
P0 - Introduction: https://medium.com/@vighnesh_prakash/terraform-modules-from-development-to-deployment-on-gitlab-9191b0aea673
P1 - Publishing Terraform Modules to Gitlab Infra Registry: https://medium.com/@vighnesh_prakash/publishing-terraform-modules-to-gitlab-infra-registry-a52755ebc712
P2 - Just Enough Gitlab: https://medium.com/@vighnesh_prakash/part-02-just-enough-gitlab-3145d0ee56e
P3 - Publishing Terraform Modules using GitLab Pipelines: https://medium.com/@vighnesh_prakash/part-03-publishing-terraform-modules-using-gitlab-pipelines-dc04186472c
P4 - Documenting Terraform Modules: https://medium.com/@vighnesh_prakash/part-04-documenting-terraform-modules-9e284d692d8
P5 - Release Strategy: https://medium.com/@vighnesh_prakash/part-05-release-strategy-80865327bf8d
P6 - Structuring Terraform Modules: https://medium.com/@vighnesh_prakash/part-06-structuring-terraform-modules-77747573c371
The Cost Crisis in Observability Tooling
https://www.honeycomb.io/blog/cost-crisis-observability-tooling
The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.
In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.
https://www.honeycomb.io/blog/cost-crisis-observability-tooling
You should never be responsible for what you don't control
https://blog.alexewerlof.com/p/responsible-for-control
And the reverse: you should take control of what you are responsible for
https://blog.alexewerlof.com/p/responsible-for-control
How to set a good only one threshold for an alert?
https://medium.com/production-care/how-to-set-a-good-only-one-threshold-for-an-alert-ddc00c975821
Did you ask yourself what is the good threshold for your alert setup?
I have worked on alerting system for more than 10 years in e-commerce or healthcare system. Setting good threshold(s) for an alert is very difficult and contentious.
https://medium.com/production-care/how-to-set-a-good-only-one-threshold-for-an-alert-ddc00c975821
Negotiating Priorities Around Incident Investigations
https://www.honeycomb.io/blog/negotiating-priorities-incident-investigations
There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions with multiple stakeholders. The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.
https://www.honeycomb.io/blog/negotiating-priorities-incident-investigations
Our commitment to OpenTelemetry
https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry
Prometheus OpenTelemetry support
https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry
YOU MIGHT BE BETTER OFF WITHOUT PULL REQUESTS
https://hamvocke.com/blog/better-off-without-pull-requests
Honestly, pull requests sound like a pretty sweet tool for collaborating on a shared code base. They are a huge success in the open source space, and looking at that success alone it’s not surprising that a lot of teams use a pull request-based process for themselves. On the other hand, there are a lot of voices out there highlighting how using pull requests as the default mechanism for collaboration can slow down your team and prevent you from getting changes into the hands of your users quickly and reliably. Patterns that worked well for low-trust open source communities, they say, didn’t translate well to teams where you know and trust all of your collaborators. Critics of pull requests often suggest alternative workflows that predate pull requests and even git and other distributed version control systems.
https://hamvocke.com/blog/better-off-without-pull-requests
tmate
https://github.com/tmate-io/tmate
Tmate is a fork of tmux. It provides an instant pairing solution.
https://github.com/tmate-io/tmate
ingestr
https://github.com/bruin-data/ingestr
Ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, no code necessary.
https://github.com/bruin-data/ingestr
daytona
https://github.com/daytonaio/daytona
Set up a development environment on any infrastructure, with a single command.
https://github.com/daytonaio/daytona
How we avoided alarm fatigue syndrome by managing/reducing the alerting noise
https://medium.com/doctolib/how-we-avoided-alarm-fatigue-syndrome-by-managing-reducing-the-alerting-noise-aac5c008d2e2
https://medium.com/doctolib/how-we-avoided-alarm-fatigue-syndrome-by-managing-reducing-the-alerting-noise-aac5c008d2e2
GitHub Actions: Terraform deployments with a review of planned changes
https://itnext.io/github-actions-terraform-deployments-with-a-review-of-planned-changes-30143358bb5c
https://itnext.io/github-actions-terraform-deployments-with-a-review-of-planned-changes-30143358bb5c
Terraform Strategies for Seamless Grafana Dashboards Across Regions
https://medium.com/tblx-insider/global-products-global-monitoring-terraform-strategies-for-seamless-grafana-dashboards-1e8c2af68512
https://medium.com/tblx-insider/global-products-global-monitoring-terraform-strategies-for-seamless-grafana-dashboards-1e8c2af68512
k8spacket - a fully based on eBPF right now
https://medium.com/@bareckidarek/k8spacket-a-fully-based-on-ebpf-right-now-e72d5383c743
https://medium.com/@bareckidarek/k8spacket-a-fully-based-on-ebpf-right-now-e72d5383c743
Measuring Developer Productivity via Humans
https://martinfowler.com/articles/measuring-developer-productivity-humans.html
Measuring developer productivity is a difficult challenge. Conventional metrics focused on development cycle time and throughput are limited, and there aren't obvious answers for where else to turn. Qualitative metrics offer a powerful way to measure and understand developer productivity using data derived from developers themselves. Organizations should prioritize measuring developer productivity using data from humans, rather than data from systems.
https://martinfowler.com/articles/measuring-developer-productivity-humans.html
How we improved ingester load balancing in Grafana Mimir with spread-minimizing tokens
https://grafana.com/blog/2024/03/07/how-we-improved-ingester-load-balancing-in-grafana-mimir-with-spread-minimizing-tokens
Grafana Mimir is our open source, horizontally scalable, multi-tenant time series database, which allows us to ingest beyond 1 billion active series. Mimir ingesters use consistent hashing, a distributed hashing technique for data replication. This technique guarantees a minimal number of relocation of time series between available ingesters when some ingesters are added or removed from the system.
Unfortunately, we noticed that the consistent hashing algorithm previously used by Mimir ingesters caused an uneven distribution of time series between ingesters, with load distribution differences going up to 25%. As a consequence, some ingesters were overwhelmed, while the others were underused. In order to solve this problem, we came up with a novel algorithm, called spread-minimizing token generation strategy, that allows us to benefit from the consistent hashing on one side and from an almost perfect load distribution on the other side.
Uniform load balancing optimizes network performance and reduces latency as the demand is equally distributed among ingesters. This allows for better usage of compute resources, which leads to more consistent performance. In this blog post, we introduce our new algorithm and show how it improved ingesters load balancing in some of our production clusters for Grafana Cloud Metrics (which is powered by Mimir) to the degree that it’s now almost perfect.
https://grafana.com/blog/2024/03/07/how-we-improved-ingester-load-balancing-in-grafana-mimir-with-spread-minimizing-tokens