Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
AWS Spot Instance selection tool - looking for automation ideas

Sharing spotinfo - a CLI that simplifies spot instance selection for automation workflows.

**What it provides**:

* Query spot prices and interruption rates
* Single Go binary, no dependencies
* Works offline (embedded data)
* JSON/CSV output for scripting
* AI assistant integration via MCP

**Current automation patterns**:

1. **Dynamic selection**:

```bash
INSTANCE=$(spotinfo --cpu=4 --memory=16 --sort=price --output=text | head -1)
terraform apply -var="instance_type=$INSTANCE"
```

2. **Region optimization**:
```bash
spotinfo --type="m5.large" --region=all --output=csv | \
awk -F',' '$5 < 10 {print $1, $6}' | sort -k2 -n
```

3. **Fleet configuration**:
```bash
spotinfo --region=us-east-1 --output=json | \
jq '[.[] | select(.Range.max < 20)]' > spot-fleet.json
```

Also works with Claude Desktop/Cursor for team members who prefer natural language queries.

GitHub: [https://github.com/alexei-led/spotinfo](https://github.com/alexei-led/spotinfo)
(Stars help me understand usage patterns)

What spot instance automation patterns are you using? Which features would make your workflows smoother?

https://redd.it/1lou2pe
@r_devops
Tried doing ASPM in-house. Gave up after 3 sprints

We’re a mid-size SaaS shop running IaC + containers + CI/CD on GitHub Actions. Thought we could build a lightweight ASPM framework with OSS + some repo scanning.

Reality: maintaining policy-as-code at scale + tracking exposures across services + correlating to runtime risk was hell. Half the alerts were noisy, the rest got buried in Jira.

We’re now testing out a commercial CNAPP with ASPM baked in. Wondering if others went this route or made internal ASPM stick?

https://redd.it/1louxim
@r_devops
Simulating Real Users in Performance Testing

Most performance tests fail to reflect reality, and that’s why their results are misleading. We know that performance testing is supposed to tell us how a system holds up under real-world usage, but what often ends up happening is the testing a simplified model that does not really reflect how users actually behave.

Take user behavior, for example. Real users don’t all behave the same way. A school app might be used mostly by students, followed by teachers, and only occasionally by admins or IT. If your load test simulates a uniform set of actions across evenly distributed users, you're not testing reality.. you’re testing a fantasy.

In terms of transaction behavior...not every function in an app gets equal use. Logging in, assigning homework, checking grades...those are daily-use functions. Others, like applying for a school trip or editing immunization records, happen rarely. But those rare actions don’t need to be in your main simulation, they’re not what’s going to crash your system on Monday morning.

Browser behavior is also often overlooked. Real browsers do a lot of optimization behind the scenes (loading resources in parallel, caching static files, managing cookies). If your testing tool isn’t mimicking these patterns, your tests are essentially stress tests, not performance simulations. Same thing with think time: humans pause! We read things, we hesitate before clicking, we take time to fill out forms. When your test scripts fire requests back-to-back with no delay, you're artificially inflating the load!

Lastly, I want to talk about server environment. If your test is running against a staging setup that’s less powerful than production, or configured differently, then your results can even be dangerous. You might either falsely panic or worse, falsely reelax.

TLDR: Performance testing only matters if it’s realistic. If you want actionable results, simulate actual user behavior with all its quirks (delays, caches, traffic patterns, and contextual priorities). Otherwise, you’re just collecting numbers that don’t reflect what users will experience.

What kinds of mistakes have you seen teams make that made performance tests useless? Or any stories where something passed in test but fell apart in prod?

https://redd.it/1lovn46
@r_devops
Dev/CloudOps Contracts

Hi, I have some free time together with a colleague, and we would like to take on some short-term or long-term contracts or projects in the DevOps/CloudOps area. Where is the best place to look for such opportunities?

https://redd.it/1lowl35
@r_devops
Announcing the Open Source Terraform Provider for OpenAI

I have an exciting announcement to make - we've just open sourced Terraform Provider for OpenAI. It covers most, if not all, resources that can be managed via an API - you can now provision your projects and service accounts as code, manage user access as code and do some fun GenAI automations as code. Check out the full announcement - including a demo of generating new Internet-available AWS Lambda Functions, with the code generated via the OAI provider and then passed to the Lambda deployment :)

https://mkdev.me/posts/announcing-the-open-source-terraform-provider-for-openai

https://redd.it/1loxtjm
@r_devops
K8s Argocd deployment changes script

I am on a new K8S project, don't have a huge amount of experience with it but learning quickly.

We are deploying our helm charts/manifests using Argocd.

I have a task/requirement that is as follow:

When the argocd pipeline is run, identify the pods/apps that have changed and then to output the changes/changelog of that change to the terminal so we can see what was changed each time if we need to check old deployments.


My plan is to do this via a python script in the pipeline:

>1. check the current deploy values file (nonprod / preprod / prod).

>2. get versions of all pods.

>3. compare with previous versions (where to get this? check the last merge?)

>4. if the version changed

>5. query the Gitlab API and get the last merge title or something like that.

>6. echo to the terminal?

Curious how other people would tackle something like this? I have been doing devops a few years but it's 99% been AWS Terraform so this is a different type of challenge for me.


https://redd.it/1loyni9
@r_devops
The tools your team picks don’t just manage work, they shape how you think about work

One thing I’ve learned leading engineering teams: the tooling you choose quietly rewires how people prioritize, communicate and think about problems.

If your system only shows tasks, people think in tasks. If it pushes sprints, they optimize for burn-down. If it buries dependencies or hides capacity, you start planning in a vacuum and wonder why things fall apart mid-sprint.

We ran into this a while back. Engineers were doing solid work but things kept getting blocked or misaligned. It wasn’t a people problem, it was that our tooling wasn’t showing us how the work moved, just what the work was.

We ended up switching tools to something more visual – a board where you could actually see relationships, blocked work and workload across the team. Not saying tooling solves everything but seeing the system clearly helped the team make better technical decisions.

I’m curious, has anyone here had a tooling change that actually impacted the way your team thinks or works? Or do most tools just end up being wrappers around the same chaos?

https://redd.it/1lp09ce
@r_devops
How to reset Linux on cloud

Sorry if it is too lame to ask this question, i actually have a way that i flush things manually:


sudo deluser --remove-home unwanted_user
sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get autoremove --purge -y
sudo rm -rf /etc/custom_config /var/log/*


But somehow i thing there should be a batter way!

Assume deleting VM/Machine and re-creating is not an option.

https://redd.it/1lp65h9
@r_devops
Incident Fest '25

Hi all,

I'm involved in a virtual festival that John Allspaw, Beth Long and Uptime Labs are running for DevOps/SREs (Incident Fest '25). It's a space where people can watch top incident responders react to challenging incidents, either live or on demand.

If this would be of interest to anyone, here's more info/signup: https://uptimelabs.io/virtual-festival-2025/

https://redd.it/1lp77zs
@r_devops
What is the actual advantage of using IaC tools for provisioning resources instead of Ansible?

For context, I am a software engineer falling in love with devops, SRE and servers

I manage my homelab cluster using mostly ansible. It currently:

Creates my Proxmox virtual machines
Manages disk passthrough to them.
Installs kubernetes and calico
Updates my UDM DNS and BGP routing
Create LVM partitions to be consumed by [OpenEBS](https://openebs.io/) later on.
etc, etc, etc

So as you can see, almost everything is managed by ansible.

In my studies/experimentations with other tools, I've settled with Pulumi (TFCDK doesn't seems very supported) because it gives me more flexibility with Python. I use it for deploying my "homelab kubernetes platform" to the aforementioned kubernetes cluster.

But like, why is using ansible for provisioning resources/charts/etc considered clunky?
I've seen other posts that suggests using ansible for configuration, and other tools for provisioning/creating resources. But managing both tools feels like a major hassle and adds some other problems like:

Which tools is the authority here?
Does ansible invoke pulumi, or the other way around?
Source of truth becomes distributed over different places
Defining what the desired state is, ends up being decentralized, because I must add separate configs for ansible and pulumi
I could define a "shared yaml" and read from that, but then I'd be taking up the responsibility of handling that myself instead of using a solution provided by a tool
Feels like a bit of a hack, etc etc etc

The best explanation I've found for this was this post that made some good points, but I'd like to hear other opinions

https://redd.it/1lp5x38
@r_devops
Well I did it, made to product hunt

I know it’s not a very cool tool but still me working in the industry for about 10 years made me think on why not build a bridge between human intent and DevOps execution and I started building an OSS tool.

https://www.producthunt.com/posts/ops0

Do you think operations are too much to handle or just repetitive all the time?

https://redd.it/1lp8f9g
@r_devops
How to safely change StorageClass reclaimPolicy from Delete to Retain without losing existing PVC data?

Hi everyone, I have a StorageClass in my Kubernetes cluster that uses reclaimPolicy: Delete by default. I’d like to change it to Retain to avoid losing persistent volume data when PVCs are deleted.

However, I want to make sure I don’t lose any existing data in the PVCs that are already using this StorageClass.

https://redd.it/1lpb2qw
@r_devops
Going from NestJS backend work to Devops. Help.

For those that have a NestJS background would love to hear how you got into Devops.

*Deep Devops, everything from hardened infrastructure to incident protocol —the whole gammut.

https://redd.it/1lpamk8
@r_devops
Learning Platform - Is KodeKloud worth it?

Hello, everyone.

I've been working with Kubernetes for a couple of months and have been learning everything as needed, but I feel I should adopt a more structured learning approach.

I have a learning budget available and have read that KodeKloud is a good option with reasonable pricing at $180 per year.

While I'm not particularly focused on certifications, I believe that certification preparation courses provide a solid framework for learning the necessary skills.

I'm considering enrolling in the CKA, CKAD, and CKS courses, then progressing to Istio and Cilium, as I need to develop more experience with service mesh and network policies.

Are there any good alternatives to KodeKloud that you would recommend?

https://redd.it/1lpgcpd
@r_devops
Startup versus established company

So, I’m working for a startup for the first time, after working for well established companies.

I’m finding the startup actually funner because instead of coming in and running into years of tech debt and glacial resistance to change I’m actually getting to just suggest doing something and being told to go ahead.

I’m actually being asked what I think is the best way to build something or implement it. There are no “legacy” systems barely limping along and no one having the bandwidth to even think about migrating it to something.

Sure, there are cons to this. Sometimes there is lack for good through out access and security policies. Sense of stability. A little too much to do and not enough people to do.

I’ve also heard horror stories of working for startups.

Am I just like in the NRE phase of this?

What are yall thoughts on the difference?

https://redd.it/1lpgsrr
@r_devops
I made a simple API to scan web ports – curious what you think

Hey! 👋
I’ve been working on a small project and finally published it on RapidAPI — it’s called WebPortSpy.

Basically, it’s an API I built myself that lets you scan open ports on a domain. The idea started as a personal tool for quick recon during audits, and I figured it might be useful to others too. There’s also an optional paid tier if you want extra stuff like identifying vulnerable ports or even suggested exploits — but the basic functionality is free to use.

I’m still improving it, so any feedback from this community would be super appreciated. If you’ve got a minute, I’d love if you could test it out or just let me know what you think.

Here’s the link:
👉 https://rapidapi.com/infosecarg-infosecarg-default/api/webportspy

Cheers!

https://redd.it/1lpiryp
@r_devops
Stuck between AWS and Azure — need your advice!

I’m about to dive into Cloud Computing, but I’m currently torn between starting with AWS or Azure.

I’ve heard the differences between them aren’t that big in terms of core concepts, and that Azure might be easier for beginners, especially with its user-friendly interface and Microsoft integration.

But I’m also thinking about the bigger picture:
• Which one has better career opportunities overall?
• Which one provides more flexibility and long-term growth?
• And is it true that once you learn one, switching to the other is relatively smooth?

Would love to hear your thoughts and experiences! Any advice or perspective is welcome 🙌

#CloudComputing #AWS #Azure #CareerGrowth #ITCareers #TechLearning

https://redd.it/1lpjif2
@r_devops
how to get job as Devops engineer

sysadmin here i love linux and want to start/ switch as a devops engineer learning on my own. how difficult it will be to get a job as devops.. do i need to do certification and all... ?

https://redd.it/1lpoech
@r_devops
Can I change my career to back-end even if I start as devOps?

A devOps job has been offered.

I was delighted because I kept failing job interviews for back-end developer.
But I still have skepticism because I don't know what exactly DevOps does.

https://redd.it/1lpq0zn
@r_devops
How do you keep track of all the changes in your deployments for audit or compliance checks?

With how fast deployments happen these days, especially in more agile or automated environments, keeping a clear, auditable trail of every single change feels like a constant battle. It's not just about knowing what changed, but who changed it, when, and why, especially when multiple teams are pushing updates continuously. That level of detail is crucial for security and compliance, but it often feels like you're trying to capture water.

The challenge really hits during an audit when you need to quickly pull up specific records or prove adherence to a standard, and the information is scattered across different tools, logs, or even mental notes. How do you manage to maintain a robust, easily auditable history of all your deployment changes without slowing down your release cycles? Thanks for any insights!

https://redd.it/1lprkx4
@r_devops