DevOps&SRE Library

testkube

Testkube natively integrates test orchestration and execution into Kubernetes and your CI/CD/GitOps pipeline. It decouples test artifacts and execution from CI/CD tooling; tests are meant to be part of your clusters state and can be executed as needed:

- Kubectl plugin
- Externally triggered via API (CI, external tooling, etc)
- Automatically on deployment of annotated/labeled services/pods/etc (WIP)

Testkube advantages:

- Avoids vendor lock-in for test orchestration and execution in CI/CD pipelines
- Makes it easy to orchestrate and run any kind of tests - functional, load/performance, security, compliance, etc. in your clusters, without having to wrap them in docker-images or providing network access
- Makes it possible to decouple test execution from build processes; engineers should be able to run specific tests whenever needed
- Centralizes all test results in a consistent format for "actionable QA analytics"
- Provides a modular architecture for adding new types of tests and executors

https://github.com/kubeshop/testkube

3.55K views15:01

DevOps&SRE Library

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

There have been many benefits gained through DoorDash’s evolution from a monolithic application architecture to one that is based on cells and microservices. The new architecture has reduced the time required for development, test, and deployment and at the same time has improved scalability and resiliency for end-users including merchants, Dashers, and consumers. As the number of microservices and back-ends has grown, however, DoorDash has observed an uptick in cross-availability zone (AZ) data transfer costs. These data transfer costs — incurred on both send and receive — allow DoorDash to provide its end users a highly available service that can withstand degradations of one or more AZs.

The cost increase prompted our engineering team to investigate alternative ways to provide the same level of service more efficiently. In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.

https://doordash.engineering/2024/01/16/staying-in-the-zone-how-doordash-used-a-service-mesh-to-manage-data-transfer-reducing-hops-and-cloud-spend

3.63K views07:02

DevOps&SRE Library

The Single Pain of Glass

How do we create better dashboards?

https://medium.com/site-reliability-engineering-leadership/the-single-pain-of-glass-6e42930e966

4.08K views15:01

DevOps&SRE Library

The importance of SEV-1 call leaders

Incidents come in different shapes and sizes. The most severe incidents require special handling that is unlike their less-critical variants. These SEV-1 (aka CRITICAL) incidents can have material financial impact for a company and create a challenging environment for any incident commander creating a need for specially designated SEV-1 call leaders.

https://argoday.medium.com/sev-1-call-leaders-8fdc0ae5f6be

4.13K views07:01

DevOps&SRE Library

SRE Archetypes

Different hats that SRE's wear in the industry: Admin, Architect, Toolsmith, and firefighter

https://blog.alexewerlof.com/p/sre-archetypes

3.89K views15:01

DevOps&SRE Library

connect() - why are you so slow?

https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance

3.59K views07:01

DevOps&SRE Library

A Distributed Systems Reading List

This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.

Since I was asked for resources again recently, I decided to pop this text into my blog. I have verified the links again and replaced those that broke with archive links or other ones, but have not sought alternative sources when the old links worked, nor taken the time to add any extra content for new material that may have been published since then.

It is meant to be used as a quick reference to understand various distsys discussions, and to discover the overall space and possibilities that are around this environment.

https://ferd.ca/a-distributed-systems-reading-list.html

4.04K views15:01

DevOps&SRE Library

Executing Cron Scripts Reliably At Scale

Cron scripts are responsible for critical Slack functionality. They ensure reminders execute on time, email notifications are sent, and databases are cleaned up, among other things. Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome. These issues lead us to design and build a better way to execute cron scripts reliably at scale.

https://slack.engineering/executing-cron-scripts-reliably-at-scale

3.79K views07:00

DevOps&SRE Library

HashiTalks 2024: Mastering Terraform Testing, a layered approach to testing complex infrastructure

https://mattias.engineer/posts/hashitalks-2024

3.46K views15:01

DevOps&SRE Library

Anti-patterns of using layers with Terraform

In the ever-evolving landscape of cloud computing and infrastructure management, Terraform has emerged as a transformative force, empowering organizations to define, provision, and manage their infrastructure as code (IaC). The versatility of Terraform extends to its project structuring, offering a myriad of approaches, from modular designs and remote state management to utilizing workspaces and version control, providing users with a spectrum of options to tailor their IaC projects to diverse needs.

In Terraform, the typical organizational structure involves modules, which are reusable units of code that encapsulate infrastructure components. These modules can be composed to create more complex infrastructure. There isn’t a standardized or official concept referred to as “Terraform layers” in the Terraform documentation. However, the term has gained popularity in the Terraform community. If you search the tern “Terraform layers” various articles will pop up discussing the concept.

https://xebia.com/blog/anti-patterns-of-using-layers-with-terraform

3.79K views07:02

DevOps&SRE Library

Parsing Terraform for Forms

Transforming Terraform Variable Types into JSON Schema

https://melvinkoh.me/parsing-terraform-for-forms-clr4zq4tu000309juab3r1lf7

3.43K views15:00

DevOps&SRE Library

Terraform Modules: From Development to Deployment on Gitlab

P0 - Introduction: https://medium.com/@vighnesh_prakash/terraform-modules-from-development-to-deployment-on-gitlab-9191b0aea673

P1 - Publishing Terraform Modules to Gitlab Infra Registry: https://medium.com/@vighnesh_prakash/publishing-terraform-modules-to-gitlab-infra-registry-a52755ebc712

P2 - Just Enough Gitlab: https://medium.com/@vighnesh_prakash/part-02-just-enough-gitlab-3145d0ee56e

P3 - Publishing Terraform Modules using GitLab Pipelines: https://medium.com/@vighnesh_prakash/part-03-publishing-terraform-modules-using-gitlab-pipelines-dc04186472c

P4 - Documenting Terraform Modules: https://medium.com/@vighnesh_prakash/part-04-documenting-terraform-modules-9e284d692d8

P5 - Release Strategy: https://medium.com/@vighnesh_prakash/part-05-release-strategy-80865327bf8d

P6 - Structuring Terraform Modules: https://medium.com/@vighnesh_prakash/part-06-structuring-terraform-modules-77747573c371

4.14K views07:01

DevOps&SRE Library

SLA vs. SLO vs. SLI: What’s the Difference?

https://uptimerobot.com/blog/sla-slo-sli

3.7K views15:01

DevOps&SRE Library

The Cost Crisis in Observability Tooling

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.

In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.

https://www.honeycomb.io/blog/cost-crisis-observability-tooling

3.37K views07:02

DevOps&SRE Library

You should never be responsible for what you don't control

And the reverse: you should take control of what you are responsible for

https://blog.alexewerlof.com/p/responsible-for-control

3.27K views15:02

DevOps&SRE Library

How to set a good only one threshold for an alert?

Did you ask yourself what is the good threshold for your alert setup?

I have worked on alerting system for more than 10 years in e-commerce or healthcare system. Setting good threshold(s) for an alert is very difficult and contentious.

https://medium.com/production-care/how-to-set-a-good-only-one-threshold-for-an-alert-ddc00c975821

3.41K views07:00

DevOps&SRE Library

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions with multiple stakeholders. The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

https://www.honeycomb.io/blog/negotiating-priorities-incident-investigations

3.3K views15:01

DevOps&SRE Library

Our commitment to OpenTelemetry

Prometheus OpenTelemetry support

https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry

3.44K views07:00

DevOps&SRE Library

YOU MIGHT BE BETTER OFF WITHOUT PULL REQUESTS

Honestly, pull requests sound like a pretty sweet tool for collaborating on a shared code base. They are a huge success in the open source space, and looking at that success alone it’s not surprising that a lot of teams use a pull request-based process for themselves. On the other hand, there are a lot of voices out there highlighting how using pull requests as the default mechanism for collaboration can slow down your team and prevent you from getting changes into the hands of your users quickly and reliably. Patterns that worked well for low-trust open source communities, they say, didn’t translate well to teams where you know and trust all of your collaborators. Critics of pull requests often suggest alternative workflows that predate pull requests and even git and other distributed version control systems.

https://hamvocke.com/blog/better-off-without-pull-requests

3.64K views15:01

DevOps&SRE Library

tmate

Tmate is a fork of tmux. It provides an instant pairing solution.

https://github.com/tmate-io/tmate

3.42K views07:01

DevOps&SRE Library

ingestr

Ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, no code necessary.

https://github.com/bruin-data/ingestr

3.53K views15:00

About

Blog

Apps

Platform