Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Free DevOps projects websites

Hi, I approached a couple of "tech influencers" to share this list however, they have not done it. I don't what the story behind 'not sharing free resources is'. The only reason I asked them is because they have a higher audience reach. So, I decided to do this myself.

I hope this helps people who are new to the field of DevOps or even experienced people. Some of them don't need a test environment. Please feel free to add if you know more. I will keep updating this post.

P.S. I do not own any of these. If you own any of them and want them removed from this list (for whatever reasons), please do let me know. I will remove them.


Linux

https://linuxupskillchallenge.org/

https://overthewire.org/wargames/


DevOps

https://workshops.aws/

https://kodekloud.com/free-labs

https://sadservers.com/scenarios

https://labs.iximiuz.com/

https://devopsupskillchallenge.com/

https://engineer.kodekloud.com/practice

https://cloudresumechallenge.dev/docs/the-challenge/aws/

https://learngitbranching.js.org/

https://labs.play-with-docker.com/

https://madhuakula.com/kubernetes-goat/

https://github.com/bregman-arie/devops-exercises

https://redd.it/1kudmi2
@r_devops
Where do you store your documentation ? Or what tool do you use

I’m looking for different documentation tools I could use in my organization. From complex technical docs to the simple todos, what do you guys use?

https://redd.it/1kueea3
@r_devops
ELI5: CAP Theorem in System Design

This is a super simple ELI5 explanation of the CAP Theorem. I mainly wrote it because I found that sources online are either not concise or lack important points. I included two system design examples where CAP Theorem is used to make design decision. Maybe this is helpful to some of you :-) Here is the repo: https://github.com/LukasNiessen/cap-theorem-explained

## Super simple explanation

C = Consistency = Every user gets the same data
A = Availability = Users can retrieve the data always
P = Partition tolerance = Even if there are network issues, everything works fine still

Now the CAP Theorem states that in a distributed system, you need to decide whether you want consistency or availability. You cannot have both.

### Questions

And in non-distributed systems? CAP Theorem only applies to distributed systems. If you only have one database, you can totally have both. (Unless that DB server if down obviously, then you have neither.

Is this always the case? No, if everything is green, we have both, consistency and availability. However, if a server looses internet access for example, or there is any other fault that occurs, THEN we have only one of the two, that is either have consistency or availability.

### Example

As I said already, the problems only arises, when we have some sort of fault. Let's look at this example.

    US (Master)                    Europe (Replica)
┌─────────────┐ ┌─────────────┐
│ │ │ │
│ Database │◄──────────────►│ Database │
│ Master │ Network │ Replica │
│ │ Replication │ │
└─────────────┘ └─────────────┘
│ │
│ │
▼ ▼
[US Users] [EU Users]


Normal operation: Everything works fine. US users write to master, changes replicate to Europe, EU users read consistent data.

Network partition happens: The connection between US and Europe breaks.

    US (Master)                    Europe (Replica)
┌─────────────┐ ┌─────────────┐
│ │ ╳╳╳╳╳╳╳ │ │
│ Database │◄────╳╳╳╳╳─────►│ Database │
│ Master │ ╳╳╳╳╳╳╳ │ Replica │
│ │ Network │ │
└─────────────┘ Fault └─────────────┘
│ │
│ │
▼ ▼
[US Users] [EU Users]


Now we have two choices:

Choice 1: Prioritize Consistency (CP)

- EU users get error messages: "Database unavailable"
- Only US users can access the system
- Data stays consistent but availability is lost for EU users

Choice 2: Prioritize Availability (AP)

- EU users can still read/write to the EU replica
- US users continue using the US master
- Both regions work, but data becomes inconsistent (EU might have old data)

## What are Network Partitions?

Network partitions are when parts of your distributed system can't talk to each other. Think of it like this:

- Your servers are like people in different rooms
- Network partitions are like the doors between rooms getting stuck
- People in each room can still talk to each other, but can't communicate with other rooms

Common causes:

- Internet connection failures
- Router crashes
- Cable cuts
- Data center outages
- Firewall issues

The key thing is: partitions WILL happen. It's not a matter of if, but when.

## The "2 out of 3" Misunderstanding

CAP Theorem is often presented as "pick 2 out of 3." This is wrong.

Partition tolerance is not optional. In distributed systems, network partitions will happen. You can't choose to "not have" partitions - they're a fact of life, like rain or traffic jams... :-)

So our choice is: When a partition happens, do you want Consistency OR Availability?

- CP Systems: When a partition occurs
→ node stops responding to maintain consistency
- AP Systems: When a partition occurs → node keeps responding but users may get inconsistent data

In other words, it's not "pick 2 out of 3," it's "partitions will happen, so pick C or A."

## System Design Example 1: Social Media Feed

Scenario: Building Netflix

Decision: Prioritize Availability (AP)

Why? If some users see slightly outdated movie names for a few seconds, it's not a big deal. But if the users cannot watch movies at all, they will be very unhappy.

## System Design Example 2: Flight Booking System

In here, we will not apply CAP Theorem to the entire system but to parts of the system. So we have two different parts with different priorities:

### Part 1: Flight Search

Scenario: Users browsing and searching for flights

Decision: Prioritize Availability

Why? Users want to browse flights even if prices/availability might be slightly outdated. Better to show approximate results than no results.

### Part 2: Flight Booking

Scenario: User actually purchasing a ticket

Decision: Prioritize Consistency

Why? If we would prioritize availibility here, we might sell the same seat to two different users. Very bad. We need strong consistency here.

### PS: Architectural Quantum

What I just described, having two different scopes, is the concept of having more than one architecture quantum. There is a lot of interesting stuff online to read about the concept of architecture quanta :-)

https://redd.it/1kufxrm
@r_devops
Quick update: That “I’ll fix your infra in 48 hours” post kinda blew up

Didn’t expect this, but that post got over 220k views, 180+ comments, and around 70 DMs.

Spent the last two weeks helping people fix all kinds of things weird CI bugs, Terraform headaches, K8s issues, GPU cost blowups… the usual chaos. A few folks just needed a nudge in the right direction, others had full-on dumpster fires.

Out of all that, 12 people offered legit work. I stuck with 3-4 of them , we’ve been deep in infra stuff for the past couple weeks and it's honestly been solid.

Here’s the part I need your help with now:

IF YOU’RE DEALING WITH INFRA OR DEVOPS PAIN RIGHT NOW . I’D LOVE TO KNOW WHAT IT IS.
Also curious what tools you’re using daily.
Drop anything even just a one-liner it’ll help me see what patterns are popping up across teams.

Still around and still down to help. Let’s keep it going.

https://redd.it/1kuhnxm
@r_devops
What’s one DevOps tool you still don’t fully trust?

I’ll go first: Helm.

I’ve used it in multiple projects, and yeah, it’s powerful—but it always feels like I’m one typo away from chaos. Templating gone wrong, values.yaml overrides not working, random “why is this resource even here” moments…

Same goes for Ansible sometimes—like I blink and it rewrites half my infra.

Do you have a tool like that?
One you use, but always double-check… just in case?

https://redd.it/1kui6os
@r_devops
Saving 50%+ off our $80K cloud monitoring bill cont'd

Checking back in my last post diving into piloting new cloud monitoring infra to tackle my client's ridiculous $80K/month o11y bill.

As planned, we expanded the pilot, getting ton more services and traffic flowing through the BYOC eBPF/OTEL setup.

The concerns about having to manage the GC stack completely miss the fully-managed point. The stack runs on our infrastructure but is 100% managed by the GC team. There is no tuning ClickHouse or monitoring it they do it all for us, and that was exactly what happened. We get an endpoint to send data to, and that’s it.

Reality vs. Sales Pitch / "Gotchas": With the BYOC approach, the customer (or my client) is the one paying for the infrastructure, so TCO is more complex (subscription + hosting) and required more back and forth up and down the chain of command. We also had to make sure all the incentives were aligned and that GC could help us optimize the infrastructure and the data stored. In other words, pay for only what we use.

I've yet to put it to the test, but G community slack channels are monitored (but NOT enterprise SLA). This is passable for now and my team will find out in the coming months.

A few key learnings during and immediately after the migration process:

\- Search syntax takes time to wrap our head around. Docs could be expanded much more.

\- Prometheus compatibility was super critical (we missed this completely during the requirement phase), but thankfully PromQL queries converted 1:1.

\- Migration tools to convert dashboards & monitors was nice touch.

Ok tldr; of everything so far, we saved money by

1. Better data tiering by reducing hot logging down to 7 days, 90 days cold for compliance.
2. Unified platforms (MELT + RUM, Hybrid eBPF/OTEL)
3. Ownning infra at no management overhead

No question at this time, I'm going to sign off and enjoy the memorial day long weekend.

https://redd.it/1kuh0t1
@r_devops
Hey everyone, I hope this is okay to post here – just looking for a few people to beta test a tool I’m working on.

I’ve been working on a tool that helps businesses get more Google reviews by automating the process of asking for them through simple text templates. It’s a service I’m calling STARSLIFT, and I’d love to get some real-world feedback before fully launching it.

Here’s what it does:

Automates the process of asking your customers for Google reviews via SMS

Lets you track reviews and see how fast you’re growing (review velocity)

Designed for service-based businesses who want more reviews but don’t have time to manually ask

Right now, I’m looking for a few U.S.-based businesses willing to test it completely free. The goal is to see how it works in real-world settings and get feedback on how to improve it.

If you:

Are a service-based business in the U.S. (think contractors, salons, dog groomers, plumbers, etc)

Get at least 5-20 customers a day

Are interested in trying it out for a few weeks
… I’d love to connect.

As a thank you, you’ll get free access even after the beta ends.

If this sounds interesting, just drop a comment or DM me with:

What kind of business you have

How many customers you typically serve in a day

Whether you’re in the U.S.

I’ll get back to you and set you up! No strings attached – this is just for me to get feedback and for you to (hopefully) get more reviews for your business.

https://redd.it/1kutyuv
@r_devops
What is the best way to learn Devops?

I am a MERN stack developer (Starting my 4th year in IT) and the way I learnt MERN is I learnt the basics of each part and started watching people build projects and build alongside them and when I didnt understand a piece of code I would use ChatGPT and document that particular concept. After 1-2 projects, I started building basic stuff.
TLDR; Learnt mern stack by YT and AI
Unfortunately I cant do the same with Devops because the concepts are too theoretical i presume. So is there something you have that will help me learn it?
PS: Sorry for the long description. Thank you for any advice.

https://redd.it/1kuw1sm
@r_devops
🚀 Milestone Unlocked: 2K Stars! 🌟

🚀 Milestone Unlocked: 2K Stars! 🌟

My Cheat-Sheet Collection just hit 2,000 stars on GitHub!
Huge thanks to everyone who starred, shared, and contributed. Your support keeps this project growing. 🙌

If you haven't checked it out yet — it's a curated collection of high-quality PDF cheat sheets for developers, DevOps engineers, and tech enthusiasts. 📚💻

Feel free to explore, contribute, and share!
\#DevOps #CheatSheet #GitHub #OpenSource #Infosec #DevSecOps #Kubernetes #Linux

https://redd.it/1kuxk2d
@r_devops
Using an really long password to ssh into a VPS is it that bad?

If you generate a password with openssl like this:

openssl rand -base64 48

FyRFHjyJIgnl2g4DsDzv49ohmt7IQyKvGpv7UyAKwGLIJalPueMh9fxJVcGOTLsm


and use that to login into a VPS - is it that bad?

I've checked the generated string here:

https://bitwarden.com/password-strength/#Password-Strength-Testing-Tool

- It says it will take centuries to crack.


In addition, when you add a wrong password, the hosting company looks like it adds a fake delay of a few seconds until it shows you the password is wrong.

I'm sure that hosting will detect if someone tries to crack your vm after a dozen of failed tries and call you.

I know the proper way of doing this is to create a new user on the vm, disable login with password by changing a few files and add your ssh keys, but compared one step using passwd it doesn't look (for me) that it will be more secure.

What's the "security" ratio here? Strong password vs SSH keys


https://redd.it/1kuz8kz
@r_devops
Spacebar Counter Using HTML, CSS and JavaScript (Free Source Code) - JV Codes 2025

With the Spacebar Counter, users can interactively count each time they press the spacebar on their keyboard. You can use this tool to check your speed or to enjoy yourself, and in each case, you’ll see a powerful example of how event handling works in JavaScript.

I have released all the source code for free, and I’ve built it using modern structure and best programming habits to enable beginners and developers to learn easily.

Source: Spacebar Counter

https://redd.it/1kuzrzm
@r_devops
🛠️ Building a No-Nonsense DevOps Course – What Would You Want In It?

Hey r/devops,

I’ve been in the DevOps space for a number of years now — led automation efforts, scaled infra, managed CI/CD pipelines, and trained engineers along the way. Now, I’m planning to build a DevOps course — but not just another course.

I want to create something that cuts through the fluff — something grounded in real-world challenges, production lessons, and what it actually takes to succeed in a DevOps role today.

The usual “install Jenkins/K8s and deploy a to-do app” just doesn’t cut it anymore. So here’s what I’m thinking:
• Production-grade examples with real troubleshooting
• Topics like GitOps, FinOps, Platform Engineering, and team workflows
• Focus on mindset: how to think like a DevOps/infra engineer, not just use tools
• Optional deep dives for those who want to go beyond “just enough to deploy”

If you were taking a course like this, what would you want to see?
What’s missing in today’s DevOps content that you wish someone taught properly?

https://redd.it/1kv43zr
@r_devops
Best Docker registry with image housekeeping support

Hi all,

We’re looking to set up a private Docker registry for our company and one of our must-have features is automatic housekeeping — we need to delete old or unused images to manage disk usage effectively.

We use Jenkins for CI/CD, which pushes images frequently, so over time our registry gets cluttered with outdated builds and untagged layers. We'd like a solution that can:

Run scheduled or on-demand cleanup jobs

Support retention policies (e.g., keep last N images or delete images older than X days)

Ideally offer a web UI and/or API for managing images

Integrate well with Jenkins or at least not get in the way


We’re currently evaluating Harbor and Nexus, but open to other suggestions too. What are you using in production for this kind of setup? Any pros/cons we should know about?

Thanks!

https://redd.it/1kv5o1v
@r_devops
transition to a devops career and the importance of certifications in the career.

I have experience in support and some infrastructure (networks and basic Linux). What would be an ideal schedule to follow to make the most of my career transition?



Another question: do certifications like LPI have an important requirement to apply for these positions?

https://redd.it/1kv7rku
@r_devops
DevOps Buddy wanted! LeetCode, tech chats, open source & more!

Hey Reddit!

Looking for someone to team up with for DevOps stuff. I wanna get better at LeetCode, chat about cool tech, mess around with open-source projects, and just keep each other motivated.

I'm really into DevOps and trying to learn more about [mention something specific you're into, like Kubernetes or AWS]. LeetCode's on my list to boost my problem-solving.

If you're up for:
* LeetCode sessions: Let's tackle problems and share ideas.
* DevOps talks: Bouncing ideas around, discussing tools, or just complaining about YAML. 😉
* General tech chats: What's new? What's cool?
* Open source fun: Exploring or even contributing.
* Being accountability buddies: Keeping each other on track.

You don't have to be a guru, just enthusiastic about learning. We can link up online (Discord/Telegram, etc.) whenever works.

If this sounds like your jam, hit me up with a comment or a DM! Let's learn together.


https://redd.it/1kv8ryp
@r_devops
How I Automated My Infrastructure with Terraform

Hello everyone!
I wanted to share one of my more... questionable engineering decisions: I Terraformed my entire home network.

I've been managing my Mikrotik setup (router + switches + wireless) with Terraform for about a year now. Everything from VLANs to firewall rules is defined as code and version controlled.

All of the code is avaliable here: https://github.com/mirceanton/mikrotik-terraform/

Why Terraform for networking?
Honestly, because it's the tool I know. When I found out the RouterOS provider existed, I just had to try it. Probably not the most practical approach, but it's been a great learning experience!

The state management situation is... creative. Can't exactly use S3 when you might accidentally terraform your own internet connection away! I ended up going with local state + SOPS encryption + Git. Works, i guess, but it's definitely not textbook.

Oh, and the amount of terraform state mv commands I've run during refactoring... SO many. I can't just destroy and recreate resources because they are, quite literally, my internet connection. I don't think I've ever had to do this much state surgery... even at work.

The whole thing taught me a lot about both Terraform and networking. Sometimes picking an overly complicated approach is the best way to learn!

Made a video about it too, if you're interested, wwhereI go into my setup as well, not just the code https://youtu.be/86LRoxuU5kg

Anyone else using Terraform in non-conventional ways? Would love to hear about other creative use cases or approaches!

https://redd.it/1kv99c6
@r_devops
Learn by doing

I'm looking to team up with some like-minded individuals who have a basic grasp of various tools and are ready to jump into some exciting projects! I've got a few cool ideas we could start working on together.

If you're interested in collaborating and bringing some of these ideas to life, let's create a Discord server and get started

https://redd.it/1kvdbhj
@r_devops
Hiring Managers

1) What are some of the skills with the most demand right now and will stay in demand for the next 30 or so years?

2) How is the job market right now for Cloud/DevOps and SRE roles?

https://redd.it/1kvesqr
@r_devops
Bare metal K8s Cluster Inherited


We inherited an infrastructure consisting of 5 physical servers that make a k8s cluster. One master and four worker nodes. They also allowed load inside the master itself as well.

It is an ancient installation and the physical servers have either RAID-0 or single disk. They used OpenEBS Hostpath for persistent volumes for all the products.

Now, this is a development cluster but it contains important data. We have several small issues to fix, like:

- Migrate the PV to a distributed storage like NFS

- Make backups of relevant data

- Reinstall the servers and have proper RAID-1 ( at least )

We do not have much resources. We do not have ( for now ) a spare server.

We do have a NFS server. We can use that.

What are good options to implement to mitigate the problems we have? Our goal is to reinstall the servers using proper RAID-1 and migrate some PV to NFS so the data is not lost if we lose one node.

I listed some actions points:

- Use the NFS, perform backups using Valero

- Migrate the PVs to the NFS storage


At least we would have backups and some safety.

But how could we start with the servers that do not have RAID-1? The very master itself is single disk. How could we reinstall it and bring it back to the cluster?

The ideal would be able to reinstall server by server until all of them have RAID-1 ( or RAID-6 ). But how could we start. We have only one master and PV attached to the nodes themselves

Would be nice to convert this setup to proxmox or some virtualization system. But I think this is a second step.

Thanks!

https://redd.it/1kvdnb3
@r_devops
Scaling Postgres with Kubernetes, guide on partitioning sharding and replication

i have written a guide on setting up high availability Postgres cluster with sharding, replication and partitioning. Hope you find this helpful. 🐘



https://blog.sagyamthapa.com.np/scaling-postgresql-with-kubernetes

https://redd.it/1kvdc66
@r_devops
👍1