DevOps&SRE Library

helm-drift

The Helm plugin that comes in handy while identifying configuration drifts (mostly due to in-place edits) from the deployed Helm charts.

https://github.com/nikhilsbhat/helm-drift

4.17K views15:01

DevOps&SRE Library

loxilb

loxilb is an open source hyper-scale software load-balancer for cloud-native workloads. It uses eBPF as its core-engine and is based on Golang. It is designed to power on-premise, edge and public-cloud Kubernetes cluster deployments.

https://github.com/loxilb-io/loxilb

3.78K views07:01

DevOps&SRE Library

Portless Ports: Demystifying Kubernetes Port Forwarding

https://journal.hexmos.com/kube-network

3.54K views15:01

DevOps&SRE Library

Binding to Low Ports as a Non-root User with Docker and Kubernetes

https://nickjanetakis.com/blog/binding-to-low-ports-as-a-non-root-user-with-docker-and-kubernetes

3.57K views07:01

DevOps&SRE Library

Zero downtime Postgres upgrades

Tl;dr: We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.

This post will go into far too much detail explaining how we did it, and considerations you might need to make along the way if you try to do the same.

It is more of a manual than anything, and includes things we learned along the way that we wish we’d known up front.

https://knock.app/blog/zero-downtime-postgres-upgrades

3.84K views15:01

DevOps&SRE Library

Avoid this mistake when running containerized applications in production

Let's talk about things we must manage when running containerized applications and how this relates to proper management of termination signals.

https://dev.to/antoinecoulon/avoid-this-when-running-containerized-applications-in-production-562k

3.75K views07:00

DevOps&SRE Library

The challenges of configuring Kubernetes resources’ Requests & Limits in combination with HPA at Scale

https://medium.com/@alexandre.highrollers/the-challenges-of-configuring-kubernetes-resources-requests-limits-in-combination-with-hpa-at-92177cb5a378

3.61K views15:02

DevOps&SRE Library

Performance Benchmarks of Cloud Machines (December 2023)

In this post, I will compare the performance metrics for different cloud providers. I’ve used standard (shared CPU) instances aith 4 vCPUs (RAM may vary) of these providers:

- GitHub Codespaces
- DigitalOcean
- Linode
- Vultr
- Hetzner
- AWS LightSail
- Google Cloud

https://bas.codes/posts/cloudbench2312

3.76K views07:00

DevOps&SRE Library

atuin

Atuin replaces your existing shell history with a SQLite database, and records additional context for your commands. Additionally, it provides optional and fully encrypted synchronisation of your history between machines, via an Atuin server.

https://github.com/atuinsh/atuin

4.11K views15:01

DevOps&SRE Library

paradedb

ParadeDB is an ElasticSearch alternative built on Postgres. We're building the features of ElasticSearch's product suite, starting with search.

https://github.com/paradedb/paradedb

3.69K views07:00

DevOps&SRE Library

marmot

Marmot is a distributed SQLite replicator with leaderless, and eventual consistency. It allows you to build a robust replication between your nodes by building on top of fault-tolerant NATS JetStream.

So if you are running a read heavy website based on SQLite, you should be easily able to scale it out by adding more SQLite replicated nodes. SQLite is probably the most ubiquitous DB that exists almost everywhere, Marmot aims to make it even more ubiquitous for server side applications by building a replication layer on top.

https://github.com/maxpert/marmot

4.2K views15:01

DevOps&SRE Library

kamal

From bare metal to cloud VMs, deploy web apps anywhere with zero downtime. Kamal has the dynamic reverse-proxy Traefik hold requests while a new app container is started and the old one is stopped. Works seamlessly across multiple hosts, using SSHKit to execute commands. Originally built for Rails apps, Kamal will work with any type of web app that can be containerized with Docker.

https://github.com/basecamp/kamal

4.29K views07:00

DevOps&SRE Library

Exploring Open Source Alternatives to Terraform Enterprise / Cloud

https://medium.com/terrakube/exploring-open-source-alternatives-to-terraform-enterprise-cloud-73acf158a6e4

4.07K views15:01

DevOps&SRE Library

Building ML Infrastructure with Terraform

https://medium.com/@alexgidiotis_96550/building-ml-infrastructure-with-terraform-520b80874e8b

3.81K views07:00

DevOps&SRE Library

5 SRE Predictions For 2024

1️⃣ Tougher Job Market for SREs

With many companies looking to cut costs due to worsening economic conditions, dedicated SRE roles may be seen as expendable - so SRE headcount and budgets could be reduced. Many organizations transition to Amazon-like model, where SWEs would "do it all". Infrastructure management, operational hardening, incident tracking and being oncall are becoming a part of the job, so reliability engineers would be slowly pushed out or would have to transition into development. We can already see these trends among colleagues being laid off in 2023, including SRE-minded companies like Google.

This combination of factors means the SRE job market will likely tighten considerably in 2024. Openings will be harder to find and competition will be steeper. SREs will need to clearly demonstrate their value to stay relevant.

2️⃣ Rise of the Hybrid Cloud

The economic realities of running workloads on major public clouds like AWS, GCP and Azure will lead companies to look for alternatives. The costs of using public cloud infrastructure and services have been climbing, eating into budgets. As companies look to reduce spending, running applications on public clouds may no longer make economic sense. We'll see a migration back towards private data centers, colocation facilities, and on-prem infrastructure. SREs skilled in on-prem operations, bare metal provisioning, etc. will be in higher demand.

3️⃣ Kubernetes will continue its dominance.

While Kubernetes benefits and operational costs are questioned a lot recently, it has become the clear leader as the orchestration platform of choice for containerized workloads. Engineers and companies are heavily invested in Kubernetes workflows and tools, both in cloud and on-prem. As companies look to further invest in efficiency of infrastructure and application management, SREs will need strong Kubernetes expertise.

4️⃣ Increased major outages due to AI-written code
(and fewer SREs)

While the automated code generation promises improved developer productivity, it also poses new reliability challenges. As code generation by AI systems increases, companies may end up with insufficiently supervised software. With fewer SREs around to establish robust testing and deployment practices, outages caused by bugs in AI-generated code could become more frequent. Companies will be caught off guard by disruptions caused by their overreliance on AI. Quick mitigations for these outages would be problematic as well, as fundamentally it'd be harder to fix code issues in AI-written code.

5️⃣ Platform Engineering Matures

In 2024, unifying infrastructure, applications, data, and services under common APIs and self-service platforms will accelerate.

These platforms will provide standardized building blocks and streamlined workflows so engineering teams can quickly build, connect and deploy applications without wasting time in infrastructure complexities. Platforms will handle provisioning, networking, monitoring, access controls, and other operational aspects behind the scenes.

With job opportunities for traditional SRE roles declining, many SREs will look to transition into platform engineering positions. The broad technical skills required by platform roles align well with strengths many SREs already have. However, to successfully land a platform engineering role, you will need to skill up on software development as well. Programming and coding will become mandatory for those looking to get into platform engineering.

https://www.codereliant.io/5-sre-predictions-for-2024

4.22K views15:01

DevOps&SRE Library

Creating an EKS Cluster Using CDKTF

https://medium.com/@stevosjt88/creating-an-eks-cluster-using-cdktf-ed6cf28599c9

3.38K views07:01

DevOps&SRE Library

Best practices to prevent alert fatigue

As your environment changes, new trends can quickly make your existing monitoring less accurate. At the same time, building alerts after every new incident can turn a straightforward strategy into a convoluted one. Treating monitoring as a one-time or reactive effort can both result in alert fatigue. Alert fatigue occurs when an excessive number of alerts are generated by monitoring systems or when alerts are irrelevant or unhelpful, leading to a diminished ability to see critical issues. Updating your alerts infrequently or too often can cause false positive alarms and redundant alerts that overwhelm your team. A desensitized team won’t be able to detect issues early and will lose trust in their monitoring systems, which can disrupt production and negatively impact your business.

https://www.datadoghq.com/blog/best-practices-to-prevent-alert-fatigue

3.57K views15:01

DevOps&SRE Library

10 Strategies to Build and Manage Scalable Infrastructure

https://spacelift.io/blog/scalable-infrastructure

3.73K views07:01

DevOps&SRE Library

pgxman