helm-drift
https://github.com/nikhilsbhat/helm-drift
The Helm plugin that comes in handy while identifying configuration drifts (mostly due to in-place edits) from the deployed Helm charts.
https://github.com/nikhilsbhat/helm-drift
loxilb
https://github.com/loxilb-io/loxilb
loxilb is an open source hyper-scale software load-balancer for cloud-native workloads. It uses eBPF as its core-engine and is based on Golang. It is designed to power on-premise, edge and public-cloud Kubernetes cluster deployments.
https://github.com/loxilb-io/loxilb
Binding to Low Ports as a Non-root User with Docker and Kubernetes
https://nickjanetakis.com/blog/binding-to-low-ports-as-a-non-root-user-with-docker-and-kubernetes
https://nickjanetakis.com/blog/binding-to-low-ports-as-a-non-root-user-with-docker-and-kubernetes
Zero downtime Postgres upgrades
https://knock.app/blog/zero-downtime-postgres-upgrades
Tl;dr: We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.
This post will go into far too much detail explaining how we did it, and considerations you might need to make along the way if you try to do the same.
It is more of a manual than anything, and includes things we learned along the way that we wish we’d known up front.
https://knock.app/blog/zero-downtime-postgres-upgrades
Avoid this mistake when running containerized applications in production
https://dev.to/antoinecoulon/avoid-this-when-running-containerized-applications-in-production-562k
Let's talk about things we must manage when running containerized applications and how this relates to proper management of termination signals.
https://dev.to/antoinecoulon/avoid-this-when-running-containerized-applications-in-production-562k
The challenges of configuring Kubernetes resources’ Requests & Limits in combination with HPA at Scale
https://medium.com/@alexandre.highrollers/the-challenges-of-configuring-kubernetes-resources-requests-limits-in-combination-with-hpa-at-92177cb5a378
https://medium.com/@alexandre.highrollers/the-challenges-of-configuring-kubernetes-resources-requests-limits-in-combination-with-hpa-at-92177cb5a378
Performance Benchmarks of Cloud Machines (December 2023)
https://bas.codes/posts/cloudbench2312
In this post, I will compare the performance metrics for different cloud providers. I’ve used standard (shared CPU) instances aith 4 vCPUs (RAM may vary) of these providers:
- GitHub Codespaces
- DigitalOcean
- Linode
- Vultr
- Hetzner
- AWS LightSail
- Google Cloud
https://bas.codes/posts/cloudbench2312
atuin
https://github.com/atuinsh/atuin
Atuin replaces your existing shell history with a SQLite database, and records additional context for your commands. Additionally, it provides optional and fully encrypted synchronisation of your history between machines, via an Atuin server.
https://github.com/atuinsh/atuin
paradedb
https://github.com/paradedb/paradedb
ParadeDB is an ElasticSearch alternative built on Postgres. We're building the features of ElasticSearch's product suite, starting with search.
https://github.com/paradedb/paradedb
marmot
https://github.com/maxpert/marmot
Marmot is a distributed SQLite replicator with leaderless, and eventual consistency. It allows you to build a robust replication between your nodes by building on top of fault-tolerant NATS JetStream.
So if you are running a read heavy website based on SQLite, you should be easily able to scale it out by adding more SQLite replicated nodes. SQLite is probably the most ubiquitous DB that exists almost everywhere, Marmot aims to make it even more ubiquitous for server side applications by building a replication layer on top.
https://github.com/maxpert/marmot
kamal
https://github.com/basecamp/kamal
From bare metal to cloud VMs, deploy web apps anywhere with zero downtime. Kamal has the dynamic reverse-proxy Traefik hold requests while a new app container is started and the old one is stopped. Works seamlessly across multiple hosts, using SSHKit to execute commands. Originally built for Rails apps, Kamal will work with any type of web app that can be containerized with Docker.
https://github.com/basecamp/kamal
Exploring Open Source Alternatives to Terraform Enterprise / Cloud
https://medium.com/terrakube/exploring-open-source-alternatives-to-terraform-enterprise-cloud-73acf158a6e4
https://medium.com/terrakube/exploring-open-source-alternatives-to-terraform-enterprise-cloud-73acf158a6e4
Building ML Infrastructure with Terraform
https://medium.com/@alexgidiotis_96550/building-ml-infrastructure-with-terraform-520b80874e8b
https://medium.com/@alexgidiotis_96550/building-ml-infrastructure-with-terraform-520b80874e8b
5 SRE Predictions For 2024
https://www.codereliant.io/5-sre-predictions-for-2024
1️⃣ Tougher Job Market for SREs
With many companies looking to cut costs due to worsening economic conditions, dedicated SRE roles may be seen as expendable - so SRE headcount and budgets could be reduced. Many organizations transition to Amazon-like model, where SWEs would "do it all". Infrastructure management, operational hardening, incident tracking and being oncall are becoming a part of the job, so reliability engineers would be slowly pushed out or would have to transition into development. We can already see these trends among colleagues being laid off in 2023, including SRE-minded companies like Google.
This combination of factors means the SRE job market will likely tighten considerably in 2024. Openings will be harder to find and competition will be steeper. SREs will need to clearly demonstrate their value to stay relevant.
2️⃣ Rise of the Hybrid Cloud
The economic realities of running workloads on major public clouds like AWS, GCP and Azure will lead companies to look for alternatives. The costs of using public cloud infrastructure and services have been climbing, eating into budgets. As companies look to reduce spending, running applications on public clouds may no longer make economic sense. We'll see a migration back towards private data centers, colocation facilities, and on-prem infrastructure. SREs skilled in on-prem operations, bare metal provisioning, etc. will be in higher demand.
3️⃣ Kubernetes will continue its dominance.
While Kubernetes benefits and operational costs are questioned a lot recently, it has become the clear leader as the orchestration platform of choice for containerized workloads. Engineers and companies are heavily invested in Kubernetes workflows and tools, both in cloud and on-prem. As companies look to further invest in efficiency of infrastructure and application management, SREs will need strong Kubernetes expertise.
4️⃣ Increased major outages due to AI-written code
(and fewer SREs)
While the automated code generation promises improved developer productivity, it also poses new reliability challenges. As code generation by AI systems increases, companies may end up with insufficiently supervised software. With fewer SREs around to establish robust testing and deployment practices, outages caused by bugs in AI-generated code could become more frequent. Companies will be caught off guard by disruptions caused by their overreliance on AI. Quick mitigations for these outages would be problematic as well, as fundamentally it'd be harder to fix code issues in AI-written code.
5️⃣ Platform Engineering Matures
In 2024, unifying infrastructure, applications, data, and services under common APIs and self-service platforms will accelerate.
These platforms will provide standardized building blocks and streamlined workflows so engineering teams can quickly build, connect and deploy applications without wasting time in infrastructure complexities. Platforms will handle provisioning, networking, monitoring, access controls, and other operational aspects behind the scenes.
With job opportunities for traditional SRE roles declining, many SREs will look to transition into platform engineering positions. The broad technical skills required by platform roles align well with strengths many SREs already have. However, to successfully land a platform engineering role, you will need to skill up on software development as well. Programming and coding will become mandatory for those looking to get into platform engineering.
https://www.codereliant.io/5-sre-predictions-for-2024
Creating an EKS Cluster Using CDKTF
https://medium.com/@stevosjt88/creating-an-eks-cluster-using-cdktf-ed6cf28599c9
https://medium.com/@stevosjt88/creating-an-eks-cluster-using-cdktf-ed6cf28599c9
Best practices to prevent alert fatigue
https://www.datadoghq.com/blog/best-practices-to-prevent-alert-fatigue
As your environment changes, new trends can quickly make your existing monitoring less accurate. At the same time, building alerts after every new incident can turn a straightforward strategy into a convoluted one. Treating monitoring as a one-time or reactive effort can both result in alert fatigue. Alert fatigue occurs when an excessive number of alerts are generated by monitoring systems or when alerts are irrelevant or unhelpful, leading to a diminished ability to see critical issues. Updating your alerts infrequently or too often can cause false positive alarms and redundant alerts that overwhelm your team. A desensitized team won’t be able to detect issues early and will lose trust in their monitoring systems, which can disrupt production and negatively impact your business.
https://www.datadoghq.com/blog/best-practices-to-prevent-alert-fatigue
10 Strategies to Build and Manage Scalable Infrastructure
https://spacelift.io/blog/scalable-infrastructure
https://spacelift.io/blog/scalable-infrastructure
better-commits
https://github.com/Everduin94/better-commits
A CLI for creating better commits following the conventional commits specification
https://github.com/Everduin94/better-commits
Provision EKS Cluster with ArgoCD by Terraform
https://yukccy.medium.com/provision-eks-cluster-with-argocd-by-terraform-4ba07a891463
https://github.com/yukccy/terraform-argocd-on-eks
https://yukccy.medium.com/provision-eks-cluster-with-argocd-by-terraform-4ba07a891463
https://github.com/yukccy/terraform-argocd-on-eks