Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Snyk free plan limits

Hi there,

I'm currently using Snyk on a private GitHub repository integrated with my GitHub Actions pipeline. Although I've exceeded the usage limits of the free plan by quite a bit, everything still seems to be working without issue.

Does anyone know why that might be the case? Should I expect the scans to stop working suddenly, or is there typically some buffer or grace period before enforcement?

Thanks in advance!

https://redd.it/1lohq5q
@r_devops
Deploying OpenStack on Azure VMs — Common Practice or Overkill?

Hey everyone,

I recently started my internship as a junior cloud architect, and I’ve been assigned a pretty interesting (and slightly overwhelming) task:
Set up a private cloud using OpenStack, but hosted entirely on Azure virtual machines.

Before I dive in too deep, I wanted to ask the community a few important questions:

1. Is this a common or realistic approach?
Using OpenStack on public cloud infrastructure like Azure feels a bit counterintuitive to me. Have you seen this done in production, or is it mainly used for learning/labs?


2. Does it help reduce costs, or can it end up being more expensive than using Azure-native services or even on-premise servers?


3. How complex is this setup in terms of architecture, networking, maintenance, and troubleshooting?
Any specific challenges I should be prepared for?


4. What are the best practices when deploying OpenStack in a public cloud environment like Azure? (e.g., VM sizing, network setup, high availability, storage options…)


5. Is OpenStack-Ansible a good fit for this scenario, or should I consider other deployment tools like Kolla-Ansible or DevStack?


6. Are there security implications I should be especially careful about when layering OpenStack over Azure?


7. If anyone has tried this before — what lessons did you learn the hard way?



If you’ve got any recommendations, links, or even personal experiences, I’d really appreciate it. I'm here to learn and avoid as many beginner mistakes as possible 😅

Thanks a lot in advance!

https://redd.it/1lol38q
@r_devops
GitHub action failing - Cannot read password despite clearly seeing it as GITHUBTOKEN

Hey guys,


Technical question here:

I am having an error where my GITHUB\
TOKEN is being seen. [ Tested by adding 'echo "${#GITHUB_TOKEN}" the pound symbol which outputs the length, obviously not the actual token \]

yet I am getting 'err: fatal: could not read Password for 'https://***@github.com': ' in my GitHub action logs when trying to run git pull.

git pull https://${GITHUBTOKEN}@github.com/x/x.git main

Banging my head across this for the past three hours. Below is how I grab the GITHUB TOKEN.



on:
push:
branches: [ main ]
jobs:
deploy:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Deploy to server
uses: appleboy/[email protected]
env:
GITHUB
TOKEN: ${{ secrets.GITHUBTOKEN }}
with:
host: ${{
secrets.HOST }}
username: ${{ secrets.USERNAME }}
key: ${{ secrets.SSH
PRIVATEKEY }}
port: ${{ secrets.PORT || 22 }}
envs: GITHUB
TOKEN
script: |

Thank you!


Mike









https://redd.it/1loje9q
@r_devops
How does your company define DevOps, SRE, and Platform Teams?

For context: I’ve been a software engineer for 20 years and got into DevOps over a decade ago. I’ve held a variety of roles since then, and one thing I’ve noticed is that every company seems to structure the “ops” side of the house differently. I’m curious as to how do other companies approach it?

At my current company, here’s how things are set up:

* **DevOps Team**: Owns cloud infrastructure, manages our CDK setup and CI/CD pipelines, and has a grab bag of other responsibilities.
* **SRE Team**: Functions more like a traditional NOC, handling day-to-day server support and managing observability. There's some overlap with the DevOps team, and the boundaries aren't always clear.
* **Platform Team**: Software engineers focused on building internal tools to support development and QA.

I’m still relatively new here, and the structure feels a bit unusual especially compared to the model laid out in Google’s SRE book. I’d love to hear how other companies are organizing things.

https://redd.it/1lomymp
@r_devops
Another team took my work to corporate leadership and now they're "leading" a global rollout while I'm cast to the shadows. I had zero knowledge of this until they failed to reverse-engineer and contacted me.

Let me start by saying I’m (early career) a year into this corporate job at a "billion-dollar" multinational company. I fully understand that any work I do while employed is legally the company's intellectual property. That said, this post is more about how I can take advantage of my contributions for my career rather than being brushed aside.

Long story short, I single-handedly modernized a legacy system used in my region, automated several processes, deployments, migrated infra to the cloud, introduced GitOps and proper CI/CD pipelines, and implemented monitoring dashboards with Prometheus+Grafana. This overhaul gained a lot of traction so much so that a team from another region requested I build the same system for them, tailored to their needs.

Now here’s where things got interesting. Apparently, while in conversations with this other region, someone higher up at the global level got access to my project and showed it to their boss who is just one level below the CEO. I still have no idea who this person is or how they even gained access to my work. Anyways, this corporate leader was so impressed that they decided the system should be rolled out globally as soon as possible. The person who shared my project then took it upon themselves to assign a team dedicated to replicating it for all regions.


Now this assigned team somehow managed to access my project (I genuinely suspect a security breach or admin-level involvement) and tried to reverse-engineer everything I built.. but failed. They then began trying to identify who was behind the project and eventually contacted my manager (the "official" project manager) by pulling him into a meeting without prior notice. Odd.

So my manager then decided to setup a proper call with this team with me involved this time. In this call, they basically came forward and requested us to provide all the code, tools, and cloud infrastructure so they can simply copy and paste it for all regions, as well as requesting several technical sessions. To make matters worse, they want me to handle all the IT bureaucratic processes for every region to get things set up (I can already see myself being roped into supporting all regions and not just my own at this point). However, I strongly believe this "replication" approach will be destined to fail as each region has different user requirements and processes not quite comparable to ours. And I also strongly believe they will struggle to get anything running, due to their limited technical and business knowledge of the processes, and the type of technical questions I was being asked.


Anyways, if this team rolls out my solution globally for each region, they’ll receive all the visibility and credit (they'll be hosting demo sessions with region leaders which for sure I wont be invited to), while I'll be essentially cast into the shadows. What’s frustrating is that I have full knowledge of the system and am responsible for it so why isn't my manager at least being the one leading this global rollout and not some random team?

I’ve been trying to indirectly nudge my manager to take ownership of the global rollout, instead of letting this new team take over. But I’m not sure how this will play out. The person who assigned this team is closer to the corporate leader, while my manager is a few steps lower in the hierarchy. So far, all he’s done is try to keep our regional manager informed of the situation playing out. Realistically, only the regional manager can mention this to the corporate leader, but I’m not confident that will happen.

My manager often says "how will this benefit the team?" But in this case, it’s clear he’s struggling to see any benefit in simply handing over our work to another team that will walk away with all the credit.

We’re still in the early stages, and I haven’t handed anything over yet. But I’m deeply
concerned about how this is unfolding. From a career perspective, it looks like I'm gaining nothing from this besides telling myself I did the work. Being so early in my career, a project like this would really benefit me tenfold. I really don't want to waste this chance to turn this into something beneficial.

https://redd.it/1lor008
@r_devops
Built an audiobook on AI infra (NVIDIA cert prep) – Free chapters out now

Hey,
If you’ve ever had to manage GPUs, troubleshoot inference endpoints, or optimize AI workloads, this might interest you:

🎧 I’m building an audiobook series based on the NVIDIA Certified AI Infrastructure & Operations (NCA-AIIO) certification.

The first 4 chapters are free and walk through:

AI infra basics
GPU architecture
AI/ML frameworks
Networking for AI inference and training

I created it for those who prefer learning on the go.
The full version will include real-world ops, deployment patterns, performance tuning, and security.

🔗 Free chapters here

Would love feedback from anyone working with production ML or AI systems!

https://redd.it/1losjiz
@r_devops
AWS Spot Instance selection tool - looking for automation ideas

Sharing spotinfo - a CLI that simplifies spot instance selection for automation workflows.

**What it provides**:

* Query spot prices and interruption rates
* Single Go binary, no dependencies
* Works offline (embedded data)
* JSON/CSV output for scripting
* AI assistant integration via MCP

**Current automation patterns**:

1. **Dynamic selection**:

```bash
INSTANCE=$(spotinfo --cpu=4 --memory=16 --sort=price --output=text | head -1)
terraform apply -var="instance_type=$INSTANCE"
```

2. **Region optimization**:
```bash
spotinfo --type="m5.large" --region=all --output=csv | \
awk -F',' '$5 < 10 {print $1, $6}' | sort -k2 -n
```

3. **Fleet configuration**:
```bash
spotinfo --region=us-east-1 --output=json | \
jq '[.[] | select(.Range.max < 20)]' > spot-fleet.json
```

Also works with Claude Desktop/Cursor for team members who prefer natural language queries.

GitHub: [https://github.com/alexei-led/spotinfo](https://github.com/alexei-led/spotinfo)
(Stars help me understand usage patterns)

What spot instance automation patterns are you using? Which features would make your workflows smoother?

https://redd.it/1lou2pe
@r_devops
Tried doing ASPM in-house. Gave up after 3 sprints

We’re a mid-size SaaS shop running IaC + containers + CI/CD on GitHub Actions. Thought we could build a lightweight ASPM framework with OSS + some repo scanning.

Reality: maintaining policy-as-code at scale + tracking exposures across services + correlating to runtime risk was hell. Half the alerts were noisy, the rest got buried in Jira.

We’re now testing out a commercial CNAPP with ASPM baked in. Wondering if others went this route or made internal ASPM stick?

https://redd.it/1louxim
@r_devops
Simulating Real Users in Performance Testing

Most performance tests fail to reflect reality, and that’s why their results are misleading. We know that performance testing is supposed to tell us how a system holds up under real-world usage, but what often ends up happening is the testing a simplified model that does not really reflect how users actually behave.

Take user behavior, for example. Real users don’t all behave the same way. A school app might be used mostly by students, followed by teachers, and only occasionally by admins or IT. If your load test simulates a uniform set of actions across evenly distributed users, you're not testing reality.. you’re testing a fantasy.

In terms of transaction behavior...not every function in an app gets equal use. Logging in, assigning homework, checking grades...those are daily-use functions. Others, like applying for a school trip or editing immunization records, happen rarely. But those rare actions don’t need to be in your main simulation, they’re not what’s going to crash your system on Monday morning.

Browser behavior is also often overlooked. Real browsers do a lot of optimization behind the scenes (loading resources in parallel, caching static files, managing cookies). If your testing tool isn’t mimicking these patterns, your tests are essentially stress tests, not performance simulations. Same thing with think time: humans pause! We read things, we hesitate before clicking, we take time to fill out forms. When your test scripts fire requests back-to-back with no delay, you're artificially inflating the load!

Lastly, I want to talk about server environment. If your test is running against a staging setup that’s less powerful than production, or configured differently, then your results can even be dangerous. You might either falsely panic or worse, falsely reelax.

TLDR: Performance testing only matters if it’s realistic. If you want actionable results, simulate actual user behavior with all its quirks (delays, caches, traffic patterns, and contextual priorities). Otherwise, you’re just collecting numbers that don’t reflect what users will experience.

What kinds of mistakes have you seen teams make that made performance tests useless? Or any stories where something passed in test but fell apart in prod?

https://redd.it/1lovn46
@r_devops
Dev/CloudOps Contracts

Hi, I have some free time together with a colleague, and we would like to take on some short-term or long-term contracts or projects in the DevOps/CloudOps area. Where is the best place to look for such opportunities?

https://redd.it/1lowl35
@r_devops
Announcing the Open Source Terraform Provider for OpenAI

I have an exciting announcement to make - we've just open sourced Terraform Provider for OpenAI. It covers most, if not all, resources that can be managed via an API - you can now provision your projects and service accounts as code, manage user access as code and do some fun GenAI automations as code. Check out the full announcement - including a demo of generating new Internet-available AWS Lambda Functions, with the code generated via the OAI provider and then passed to the Lambda deployment :)

https://mkdev.me/posts/announcing-the-open-source-terraform-provider-for-openai

https://redd.it/1loxtjm
@r_devops
K8s Argocd deployment changes script

I am on a new K8S project, don't have a huge amount of experience with it but learning quickly.

We are deploying our helm charts/manifests using Argocd.

I have a task/requirement that is as follow:

When the argocd pipeline is run, identify the pods/apps that have changed and then to output the changes/changelog of that change to the terminal so we can see what was changed each time if we need to check old deployments.


My plan is to do this via a python script in the pipeline:

>1. check the current deploy values file (nonprod / preprod / prod).

>2. get versions of all pods.

>3. compare with previous versions (where to get this? check the last merge?)

>4. if the version changed

>5. query the Gitlab API and get the last merge title or something like that.

>6. echo to the terminal?

Curious how other people would tackle something like this? I have been doing devops a few years but it's 99% been AWS Terraform so this is a different type of challenge for me.


https://redd.it/1loyni9
@r_devops
The tools your team picks don’t just manage work, they shape how you think about work

One thing I’ve learned leading engineering teams: the tooling you choose quietly rewires how people prioritize, communicate and think about problems.

If your system only shows tasks, people think in tasks. If it pushes sprints, they optimize for burn-down. If it buries dependencies or hides capacity, you start planning in a vacuum and wonder why things fall apart mid-sprint.

We ran into this a while back. Engineers were doing solid work but things kept getting blocked or misaligned. It wasn’t a people problem, it was that our tooling wasn’t showing us how the work moved, just what the work was.

We ended up switching tools to something more visual – a board where you could actually see relationships, blocked work and workload across the team. Not saying tooling solves everything but seeing the system clearly helped the team make better technical decisions.

I’m curious, has anyone here had a tooling change that actually impacted the way your team thinks or works? Or do most tools just end up being wrappers around the same chaos?

https://redd.it/1lp09ce
@r_devops
How to reset Linux on cloud

Sorry if it is too lame to ask this question, i actually have a way that i flush things manually:


sudo deluser --remove-home unwanted_user
sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get autoremove --purge -y
sudo rm -rf /etc/custom_config /var/log/*


But somehow i thing there should be a batter way!

Assume deleting VM/Machine and re-creating is not an option.

https://redd.it/1lp65h9
@r_devops
Incident Fest '25

Hi all,

I'm involved in a virtual festival that John Allspaw, Beth Long and Uptime Labs are running for DevOps/SREs (Incident Fest '25). It's a space where people can watch top incident responders react to challenging incidents, either live or on demand.

If this would be of interest to anyone, here's more info/signup: https://uptimelabs.io/virtual-festival-2025/

https://redd.it/1lp77zs
@r_devops
What is the actual advantage of using IaC tools for provisioning resources instead of Ansible?

For context, I am a software engineer falling in love with devops, SRE and servers

I manage my homelab cluster using mostly ansible. It currently:

Creates my Proxmox virtual machines
Manages disk passthrough to them.
Installs kubernetes and calico
Updates my UDM DNS and BGP routing
Create LVM partitions to be consumed by [OpenEBS](https://openebs.io/) later on.
etc, etc, etc

So as you can see, almost everything is managed by ansible.

In my studies/experimentations with other tools, I've settled with Pulumi (TFCDK doesn't seems very supported) because it gives me more flexibility with Python. I use it for deploying my "homelab kubernetes platform" to the aforementioned kubernetes cluster.

But like, why is using ansible for provisioning resources/charts/etc considered clunky?
I've seen other posts that suggests using ansible for configuration, and other tools for provisioning/creating resources. But managing both tools feels like a major hassle and adds some other problems like:

Which tools is the authority here?
Does ansible invoke pulumi, or the other way around?
Source of truth becomes distributed over different places
Defining what the desired state is, ends up being decentralized, because I must add separate configs for ansible and pulumi
I could define a "shared yaml" and read from that, but then I'd be taking up the responsibility of handling that myself instead of using a solution provided by a tool
Feels like a bit of a hack, etc etc etc

The best explanation I've found for this was this post that made some good points, but I'd like to hear other opinions

https://redd.it/1lp5x38
@r_devops
Well I did it, made to product hunt

I know it’s not a very cool tool but still me working in the industry for about 10 years made me think on why not build a bridge between human intent and DevOps execution and I started building an OSS tool.

https://www.producthunt.com/posts/ops0

Do you think operations are too much to handle or just repetitive all the time?

https://redd.it/1lp8f9g
@r_devops
How to safely change StorageClass reclaimPolicy from Delete to Retain without losing existing PVC data?

Hi everyone, I have a StorageClass in my Kubernetes cluster that uses reclaimPolicy: Delete by default. I’d like to change it to Retain to avoid losing persistent volume data when PVCs are deleted.

However, I want to make sure I don’t lose any existing data in the PVCs that are already using this StorageClass.

https://redd.it/1lpb2qw
@r_devops
Going from NestJS backend work to Devops. Help.

For those that have a NestJS background would love to hear how you got into Devops.

*Deep Devops, everything from hardened infrastructure to incident protocol —the whole gammut.

https://redd.it/1lpamk8
@r_devops
Learning Platform - Is KodeKloud worth it?

Hello, everyone.

I've been working with Kubernetes for a couple of months and have been learning everything as needed, but I feel I should adopt a more structured learning approach.

I have a learning budget available and have read that KodeKloud is a good option with reasonable pricing at $180 per year.

While I'm not particularly focused on certifications, I believe that certification preparation courses provide a solid framework for learning the necessary skills.

I'm considering enrolling in the CKA, CKAD, and CKS courses, then progressing to Istio and Cilium, as I need to develop more experience with service mesh and network policies.

Are there any good alternatives to KodeKloud that you would recommend?

https://redd.it/1lpgcpd
@r_devops