excalidraw
https://github.com/excalidraw/excalidraw
An open source virtual hand-drawn style whiteboard.
https://github.com/excalidraw/excalidraw
How Figma’s databases team lived to tell the scale
https://www.figma.com/blog/how-figmas-databases-team-lived-to-tell-the-scale
Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability.
Figma’s database stack has grown almost 100x since 2020. This is a good problem to have because it means our business is expanding, but it also poses some tricky technical challenges. Over the past four years, we’ve made a significant effort to stay ahead of the curve and avoid potential growing pains. In 2020, we were running a single Postgres database hosted on AWS’s largest physical instance, and by the end of 2022, we had built out a distributed architecture with caching, read replicas, and a dozen vertically partitioned databases. We split groups of related tables—like “Figma files” or “Organizations”—into their own vertical partitions, which allowed us to make incremental scaling gains and maintain enough runway to stay ahead of our growth.
https://www.figma.com/blog/how-figmas-databases-team-lived-to-tell-the-scale
garnet
https://github.com/microsoft/garnet
Garnet is a remote cache-store from Microsoft Research that offers strong performance (throughput and latency), scalability, storage, recovery, cluster sharding, key migration, and replication features. Garnet can work with existing Redis clients.
https://github.com/microsoft/garnet
Fine-grained RBAC for GitHub Action workflows With GitHub OIDC and HashiCorp Vault
https://www.digitalocean.com/blog/fine-grained-rbac-for-github-action-workflows-hashicorp-vault
https://www.digitalocean.com/blog/fine-grained-rbac-for-github-action-workflows-hashicorp-vault
Properly Running Kubernetes Jobs with Sidecars in 2024 (K8s 1.28+)
https://medium.com/teamsnap-engineering/properly-running-kubernetes-jobs-with-sidecars-in-2024-k8s-1-28-ad9b51d17d50
Kubernetes has been a great orchestrator of Jobs and CronJobs for over half a decade now, but if you had a need for running proxy containers or other secondary containers alongside the job, running things properly took a bit of work and decision-making to handle gracefully.
This article introduces the easiest way to run Jobs with sidecars using the latest Kubernetes features, and has a complementary repository with complete example manifests you can try in your own cluster. The repository contains all the examples for earlier versions of K8s as well, so make sure to focus on the cronjob.sidecar.*.yaml examples.
https://medium.com/teamsnap-engineering/properly-running-kubernetes-jobs-with-sidecars-in-2024-k8s-1-28-ad9b51d17d50
Best practices for monitoring software testing in CI/CD
https://www.datadoghq.com/blog/best-practices-for-monitoring-software-testing
A key challenge of monitoring your CI/CD system is understanding how to optimize your workflows and create best practices that help you minimize pipeline slowdowns and better respond to CI issues. In addition to monitoring CI pipelines and their underlying infrastructure, your organization also needs to cultivate effective relationships between platform and development teams. Fostering collaboration between these two teams is a critical and equally valuable aspect of improving the reliability and performance of your CI.
In this post, we’ll explore how platform teams can help developers visualize trends in CI test performance and notify them of new flaky tests, test failures, and performance regressions with dashboards and monitors. We’ll also detail best practices that can help developers identify, investigate, and remediate flaky tests.
https://www.datadoghq.com/blog/best-practices-for-monitoring-software-testing
Documentation as code: Principles, workflow, and challenges
https://www.tabnine.com/blog/documentation-as-code-principles-workflow-and-challenges
Core principles of documentation-as-code tools
- Treating documentation with the same rigor as code
- Storing documentation in version control
- Automation of documentation generation and deployment
- Peer review processes for documentation updates
https://www.tabnine.com/blog/documentation-as-code-principles-workflow-and-challenges
Service Level Agreement
https://blog.alexewerlof.com/p/sla
Introduction to the SLA in relation to SLI and SLO
https://blog.alexewerlof.com/p/sla
How to deal with alert fatigue head-on
https://incident.io/hubs/on-call/dealing-with-alert-fatigue-head-on
Everyone experiences stress at work—thankfully, it’s a topic folks aren’t shying away from anymore.
But for on-call engineers, alert fatigue is a phenomenon closer to home. Unfortunately, like stress, it can be just as insidious and drastically impact those it affects.
First discussed in the context of hospital settings, this phrase later entered engineering circles. Alert fatigue is when an excessive number of alerts overwhelms the individuals responsible for answering them, often over a prolonged period, resulting in missed or delayed responses, or them being ignored altogether
The impact of this fatigue can have an effect beyond the individual and can create significant risks for your organization.
But, if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we'll dive into the tactics teams can implement to address alert fatigue and its underlying causes.
https://incident.io/hubs/on-call/dealing-with-alert-fatigue-head-on
Different Ways to Aggregate Nines
https://hross.substack.com/p/different-ways-to-aggregate-nines
While working on SLOs, SLAs and SLIs I have found that there are only so many ways to aggregate service metrics. I have not yet found somewhere that attempts to review the different aggregation methods and what their relative strengths and weaknesses are.
https://hross.substack.com/p/different-ways-to-aggregate-nines
Distributed Tracing: A Whistle Stop Tour
https://metoro.io/blog/distributed-tracing-whistle-stop-tour
Know enough to be dangerous in 10 minutes
https://metoro.io/blog/distributed-tracing-whistle-stop-tour
spqr
https://github.com/pg-sharding/spqr
SPQR is a production-ready system for horizontal scaling of PostgreSQL via sharding. We appreciate any kind of feedback and contribution to the project.
https://github.com/pg-sharding/spqr
Grafana Loki: Optimising log based metrics
https://dev.to/siddharthjain1715/grafana-loki-optimising-log-based-metrics-5edb
There are multiple layers where the performance of Loki can be improved and fine-tuned. From optimising the query, channeling it efficiently for processing, to allocating the right computational resources, we will cover the following parameters that make a significant improvement to the performance.
https://dev.to/siddharthjain1715/grafana-loki-optimising-log-based-metrics-5edb
Is GitOps actually useful?
https://medium.com/@briankgrant/is-gitops-actually-useful-a1c851ba99d8
GitOps doesn’t solve all deployment problems or even cover the entire deployment process, but it’s a solid foundational building block.
https://medium.com/@briankgrant/is-gitops-actually-useful-a1c851ba99d8
Automation using Control planes vs. Command-line tools
https://medium.com/@briankgrant/automation-using-control-planes-vs-command-line-tools-66f818ff8278
https://medium.com/@briankgrant/automation-using-control-planes-vs-command-line-tools-66f818ff8278
Monorepos vs. many repos: is there a good answer?
https://medium.com/@briankgrant/monorepos-vs-many-repos-is-there-a-good-answer-9bac102971da
https://medium.com/@briankgrant/monorepos-vs-many-repos-is-there-a-good-answer-9bac102971da
The Technical History of Kubernetes
https://medium.com/@briankgrant/the-technical-history-of-kubernetes-2fe1988b522a
https://medium.com/@briankgrant/the-technical-history-of-kubernetes-2fe1988b522a
rotz
https://github.com/volllly/rotz
Fully cross platform dotfile manager and dev environment bootstrapper written in Rust.
https://github.com/volllly/rotz
Moving fast breaks things: the importance of a staging environment
https://graphite.dev/blog/staging-environment
https://graphite.dev/blog/staging-environment
SLO formulas implementation in PromQL step by step
https://mkaz.me/blog/2024/slo-formulas-implementation-in-promql-step-by-step
https://mkaz.me/blog/2024/slo-formulas-implementation-in-promql-step-by-step