Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
I'm getting an error after certificate renewal please help

Hello,
My Kubernetes cluster was running smoothly until I tried to renew the certificates after they expired. I ran the following commands:

>sudo kubeadm certs renew all

>echo 'export KUBECONFIG=/etc/kubernetes/admin.conf' >> \~/.bashrc

>source \~/.bashrc


After that, some abnormalities started to appear in my cluster. Calico is completely down and even after deleting and reinstalling it, it does not come back up at all.

When I check the daemonsets and deployments in the kube-system namespace, I see:

>kubectl get daemonset -n kube-system

>NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE

>calico-node 0 0 0 0 0 kubernetes.io/os=linux 4m4s

>

>kubectl get deployments -n kube-system

>NAME READY UP-TO-DATE AVAILABLE AGE

>calico-kube-controllers 0/1 0 0 4m19s


Before this, I was also getting "unauthorized" errors in the kubelet logs, which started after renewing the certificates. This is definitely abnormal because the pods created from deployments are not coming up and remain stuck.

There is no error message shown during deployment either. Please help.

https://redd.it/1lo52hc
@r_devops
Update: DockedUp v1.0.0 release, check the demo once !!!

Hey r/devops!

Last week I introduced **DockedUp** — a real-time, interactive terminal dashboard for managing Docker containers. Thanks so much for the support and feedback! 🙌

I’ve just pushed a big update with performance improvements, better logs, and smoother UI — plus a new demo to show it off:

**Check out the new demo GIF**

### Install via pip or pipx:
    pipx install dockedup
### or
pip install dockedup

### Then just run:
    dockedup

#### Links:

GitHub: [github.com/anilrajrimal1/dockedup](https://github.com/anilrajrimal1/dockedup)
PyPI: pypi.org/project/dockedup

https://redd.it/1loa4j8
@r_devops
Suggestions for an innovation sprint project? What useful new concepts or tech is 'trending'?

We are planning an innovation sprint (1 week to create a demo/PoC for a green-field project, 1 week to finalise, prep slides and demonstrate) and are at the ideas stage. I had hard plans of what I wanted to use the time for which were completely trainwrecked by a late directive to fit RnD tax credits.

I'm now in a position where I am absolutely uninterested and would like some help taking back some control of this valuable time - and not get roped in as a 6th person working on a 'support hub chat bot' project.

Any suggestions for things to consider?
\- Is there somewhere I follow for good coverage of new trends and evolution in the DevOps field?
\- We have aks clusters in azure for deployments without any tools like Kubecost implemented. Could be a good way to brush up on my k8s/helm knowledge and deliver something that would look good in my annual review if it manages any costs savings?

Thanks for any advice!

https://redd.it/1loa5w5
@r_devops
Is it possible to route non http traffic by DNS with Istio

My assumption is no, but maybe there’s something that would work

Let’s say I have a JDBC connection for 3 databases db1.com, db2.com, db3.com

In K8 with istio virtual services/gateway (without multiple load balancers) is it possible for all 3 connections to listen on tcp 5432 and then route to a db in a specific namespace

Example, assume the LB in the 3 is the exact same

User (db1) —> LB(5432) —> namespace 1

User (db2) —> LB(5432) —> namespace 2

User (db3) —> LB(5432) —> namespace 3

My assumption as this isn’t http we’d be looking at L4 meaning the DNS would be unknown to us/not usable.

Is this correct? Is there anyway to do the above for a DB tcp connection with a single LB/port but route to namespaces based on the DNS name?

https://redd.it/1lodag9
@r_devops
Good observability tooling doesn’t mean teams actually understand it

Been an engineering manager at a large org for close to three years now. We’re not exactly a “digitally native” company, but we have \~5K developers. Platform org has solid observability tooling (LGTM stack, decent golden paths).

What I keep seeing though - both in my team and across the org - is that product engineers rarely understand the nuances of the “three pillars” of observability - logs, metrics, and traces.

Not because they’re careless, but because their cognitive budget is limited. They're focused on delivering product value, and learning three completely different mental models for telemetry is a real cost.

Even with good platform support, that knowledge gap has real implications -

* Slower incident response and triage
* Platform teams needing to educate and support a lot more
* Alert fatigue and poor signal-to-noise ratios

I wrote up [some thoughts](https://open.substack.com/pub/musingsonsoftware/p/org-implications-of-contemporary?r=57p3s&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true) on why these three pillars exist (hint - it’s storage and query constraints) and what that means for teams trying to build observability maturity -

* Metrics, logs, and traces are separate because they store and query data differently.
* That separation forces dev teams to learn three mental models.
* Even with “golden path” tooling, you can’t fully outsource that cognitive load.
* We should be thinking about unified developer experience, not just unified tooling.

Curious if others here have seen the same gap between tooling maturity and team understanding and if you do I'm eager to understand how you address it in your orgs.

https://redd.it/1loes4q
@r_devops
VENT Seeing engineers use LLMs to generate all the code that I used to write for them is concerning

One of our engineering directors decided to spin up a new service. Within minutes, he was able to produce the scripts / terraform to bring up the infra for these services, along with the scripts to deploy them. It’s very clear that this code was written by an LLM

It’s good, clean code too.

This is all stuff that I used to do, and I am realizing that pretty soon I will no longer be needed for this set of tasks.

This leads me to wonder what types of tasks I should focus on so as not to get automated away entirely.

I'm not trying to be a luddite or an alarmist. It's great that these tools have enabled higher productivity, and honestly writing those types of scripts was never particularly fun or engaging. Just trying to stay ahead of getting eaten by the AI bear.

https://redd.it/1lodvtm
@r_devops
Monday Questions - r/DevOptimize

r/DevOptimize is taking questions on making delivery simpler and packaging. Feel free to ask here or there.

* Are your deploys more steps than "install packages; per-env config; start services"? more than 100 lines?
* Do you have separate IaC source repos or branches for each environment? Let's discuss!
* Do you have more than two or three layers in your container build?

https://redd.it/1loi1wt
@r_devops
Are we supposed to know everything?

I used to think DevOps interviews would focus on CI/CD, observability, and maybe some k8s troubleshooting.
Then came a “design a distributed key-value store” question. My brain just… rebooted.

It’s not that I didn’t know what quorum or replication meant. But I hadn’t reviewed consensus protocols since college. I fumbled the difference between consistency and availability under pressure.

That interview was a wake-up call: if you're applying to DevOps roles that lean heavy on the “dev,” you will be asked to reason through failure models, caching layers, GC behavior, or how your system handles 4x traffic spikes without falling over.

Since then, I’ve been treating system design prep like a separate skill. I watch ByteByteGo on 1.5x speed. I sketch distributed tracing pipelines in Notion. I’ve also been using Beyz coding assistant to walk through mock scenarios. The kind where you have to balance tradeoffs and justify design choices on the fly.

It’s not about memorizing Raft vs Paxos. It’s about showing that you can ask good questions, make sane decisions, and evolve your design when requirements shift. (Also, knowing when not to build a whole new infra stack just to sound smart.)

System design interviews aren't going away. But neither is your ability to improve. Anyone else trying to "relearn" distributed systems after years of just... shipping YAML?

https://redd.it/1loj6m2
@r_devops
I need an UDP load balancer that can retry on timeouts

Greetings, friends,

Recently, I've been frantically searching for a solution to my problem:

I have a system that is composed of multiple servers that receive UDP packets and send back responses.

I need a load balancer that can also retry sending the UDP packet if no response comes back to it within 3 milliseconds. I need to check for ANY response, no parsing or anything.

I know that no response is to be expected from UDP, however, unfortunately, that is exactly what I need, otherwise, I have some edge cases where I no longer have 100% availability.

So far, I'm using Envoy Proxy, however, it does not support such a functionality for UDP.

I looked into potentially extending Envoy proxy, to create a custom UDP filter with these retries, however, it seems to be a pretty daunting task.

I couldn't even compile Envoy to begin with. It took 4 hours and ended in an error.

Does anyone know of any solution that could help achieve this? A LOT of traffic needs to be handled.



https://redd.it/1loix24
@r_devops
Snyk free plan limits

Hi there,

I'm currently using Snyk on a private GitHub repository integrated with my GitHub Actions pipeline. Although I've exceeded the usage limits of the free plan by quite a bit, everything still seems to be working without issue.

Does anyone know why that might be the case? Should I expect the scans to stop working suddenly, or is there typically some buffer or grace period before enforcement?

Thanks in advance!

https://redd.it/1lohq5q
@r_devops
Deploying OpenStack on Azure VMs — Common Practice or Overkill?

Hey everyone,

I recently started my internship as a junior cloud architect, and I’ve been assigned a pretty interesting (and slightly overwhelming) task:
Set up a private cloud using OpenStack, but hosted entirely on Azure virtual machines.

Before I dive in too deep, I wanted to ask the community a few important questions:

1. Is this a common or realistic approach?
Using OpenStack on public cloud infrastructure like Azure feels a bit counterintuitive to me. Have you seen this done in production, or is it mainly used for learning/labs?


2. Does it help reduce costs, or can it end up being more expensive than using Azure-native services or even on-premise servers?


3. How complex is this setup in terms of architecture, networking, maintenance, and troubleshooting?
Any specific challenges I should be prepared for?


4. What are the best practices when deploying OpenStack in a public cloud environment like Azure? (e.g., VM sizing, network setup, high availability, storage options…)


5. Is OpenStack-Ansible a good fit for this scenario, or should I consider other deployment tools like Kolla-Ansible or DevStack?


6. Are there security implications I should be especially careful about when layering OpenStack over Azure?


7. If anyone has tried this before — what lessons did you learn the hard way?



If you’ve got any recommendations, links, or even personal experiences, I’d really appreciate it. I'm here to learn and avoid as many beginner mistakes as possible 😅

Thanks a lot in advance!

https://redd.it/1lol38q
@r_devops
GitHub action failing - Cannot read password despite clearly seeing it as GITHUBTOKEN

Hey guys,


Technical question here:

I am having an error where my GITHUB\
TOKEN is being seen. [ Tested by adding 'echo "${#GITHUB_TOKEN}" the pound symbol which outputs the length, obviously not the actual token \]

yet I am getting 'err: fatal: could not read Password for 'https://***@github.com': ' in my GitHub action logs when trying to run git pull.

git pull https://${GITHUBTOKEN}@github.com/x/x.git main

Banging my head across this for the past three hours. Below is how I grab the GITHUB TOKEN.



on:
push:
branches: [ main ]
jobs:
deploy:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Deploy to server
uses: appleboy/[email protected]
env:
GITHUB
TOKEN: ${{ secrets.GITHUBTOKEN }}
with:
host: ${{
secrets.HOST }}
username: ${{ secrets.USERNAME }}
key: ${{ secrets.SSH
PRIVATEKEY }}
port: ${{ secrets.PORT || 22 }}
envs: GITHUB
TOKEN
script: |

Thank you!


Mike









https://redd.it/1loje9q
@r_devops
How does your company define DevOps, SRE, and Platform Teams?

For context: I’ve been a software engineer for 20 years and got into DevOps over a decade ago. I’ve held a variety of roles since then, and one thing I’ve noticed is that every company seems to structure the “ops” side of the house differently. I’m curious as to how do other companies approach it?

At my current company, here’s how things are set up:

* **DevOps Team**: Owns cloud infrastructure, manages our CDK setup and CI/CD pipelines, and has a grab bag of other responsibilities.
* **SRE Team**: Functions more like a traditional NOC, handling day-to-day server support and managing observability. There's some overlap with the DevOps team, and the boundaries aren't always clear.
* **Platform Team**: Software engineers focused on building internal tools to support development and QA.

I’m still relatively new here, and the structure feels a bit unusual especially compared to the model laid out in Google’s SRE book. I’d love to hear how other companies are organizing things.

https://redd.it/1lomymp
@r_devops
Another team took my work to corporate leadership and now they're "leading" a global rollout while I'm cast to the shadows. I had zero knowledge of this until they failed to reverse-engineer and contacted me.

Let me start by saying I’m (early career) a year into this corporate job at a "billion-dollar" multinational company. I fully understand that any work I do while employed is legally the company's intellectual property. That said, this post is more about how I can take advantage of my contributions for my career rather than being brushed aside.

Long story short, I single-handedly modernized a legacy system used in my region, automated several processes, deployments, migrated infra to the cloud, introduced GitOps and proper CI/CD pipelines, and implemented monitoring dashboards with Prometheus+Grafana. This overhaul gained a lot of traction so much so that a team from another region requested I build the same system for them, tailored to their needs.

Now here’s where things got interesting. Apparently, while in conversations with this other region, someone higher up at the global level got access to my project and showed it to their boss who is just one level below the CEO. I still have no idea who this person is or how they even gained access to my work. Anyways, this corporate leader was so impressed that they decided the system should be rolled out globally as soon as possible. The person who shared my project then took it upon themselves to assign a team dedicated to replicating it for all regions.


Now this assigned team somehow managed to access my project (I genuinely suspect a security breach or admin-level involvement) and tried to reverse-engineer everything I built.. but failed. They then began trying to identify who was behind the project and eventually contacted my manager (the "official" project manager) by pulling him into a meeting without prior notice. Odd.

So my manager then decided to setup a proper call with this team with me involved this time. In this call, they basically came forward and requested us to provide all the code, tools, and cloud infrastructure so they can simply copy and paste it for all regions, as well as requesting several technical sessions. To make matters worse, they want me to handle all the IT bureaucratic processes for every region to get things set up (I can already see myself being roped into supporting all regions and not just my own at this point). However, I strongly believe this "replication" approach will be destined to fail as each region has different user requirements and processes not quite comparable to ours. And I also strongly believe they will struggle to get anything running, due to their limited technical and business knowledge of the processes, and the type of technical questions I was being asked.


Anyways, if this team rolls out my solution globally for each region, they’ll receive all the visibility and credit (they'll be hosting demo sessions with region leaders which for sure I wont be invited to), while I'll be essentially cast into the shadows. What’s frustrating is that I have full knowledge of the system and am responsible for it so why isn't my manager at least being the one leading this global rollout and not some random team?

I’ve been trying to indirectly nudge my manager to take ownership of the global rollout, instead of letting this new team take over. But I’m not sure how this will play out. The person who assigned this team is closer to the corporate leader, while my manager is a few steps lower in the hierarchy. So far, all he’s done is try to keep our regional manager informed of the situation playing out. Realistically, only the regional manager can mention this to the corporate leader, but I’m not confident that will happen.

My manager often says "how will this benefit the team?" But in this case, it’s clear he’s struggling to see any benefit in simply handing over our work to another team that will walk away with all the credit.

We’re still in the early stages, and I haven’t handed anything over yet. But I’m deeply
concerned about how this is unfolding. From a career perspective, it looks like I'm gaining nothing from this besides telling myself I did the work. Being so early in my career, a project like this would really benefit me tenfold. I really don't want to waste this chance to turn this into something beneficial.

https://redd.it/1lor008
@r_devops
Built an audiobook on AI infra (NVIDIA cert prep) – Free chapters out now

Hey,
If you’ve ever had to manage GPUs, troubleshoot inference endpoints, or optimize AI workloads, this might interest you:

🎧 I’m building an audiobook series based on the NVIDIA Certified AI Infrastructure & Operations (NCA-AIIO) certification.

The first 4 chapters are free and walk through:

AI infra basics
GPU architecture
AI/ML frameworks
Networking for AI inference and training

I created it for those who prefer learning on the go.
The full version will include real-world ops, deployment patterns, performance tuning, and security.

🔗 Free chapters here

Would love feedback from anyone working with production ML or AI systems!

https://redd.it/1losjiz
@r_devops
AWS Spot Instance selection tool - looking for automation ideas

Sharing spotinfo - a CLI that simplifies spot instance selection for automation workflows.

**What it provides**:

* Query spot prices and interruption rates
* Single Go binary, no dependencies
* Works offline (embedded data)
* JSON/CSV output for scripting
* AI assistant integration via MCP

**Current automation patterns**:

1. **Dynamic selection**:

```bash
INSTANCE=$(spotinfo --cpu=4 --memory=16 --sort=price --output=text | head -1)
terraform apply -var="instance_type=$INSTANCE"
```

2. **Region optimization**:
```bash
spotinfo --type="m5.large" --region=all --output=csv | \
awk -F',' '$5 < 10 {print $1, $6}' | sort -k2 -n
```

3. **Fleet configuration**:
```bash
spotinfo --region=us-east-1 --output=json | \
jq '[.[] | select(.Range.max < 20)]' > spot-fleet.json
```

Also works with Claude Desktop/Cursor for team members who prefer natural language queries.

GitHub: [https://github.com/alexei-led/spotinfo](https://github.com/alexei-led/spotinfo)
(Stars help me understand usage patterns)

What spot instance automation patterns are you using? Which features would make your workflows smoother?

https://redd.it/1lou2pe
@r_devops
Tried doing ASPM in-house. Gave up after 3 sprints

We’re a mid-size SaaS shop running IaC + containers + CI/CD on GitHub Actions. Thought we could build a lightweight ASPM framework with OSS + some repo scanning.

Reality: maintaining policy-as-code at scale + tracking exposures across services + correlating to runtime risk was hell. Half the alerts were noisy, the rest got buried in Jira.

We’re now testing out a commercial CNAPP with ASPM baked in. Wondering if others went this route or made internal ASPM stick?

https://redd.it/1louxim
@r_devops
Simulating Real Users in Performance Testing

Most performance tests fail to reflect reality, and that’s why their results are misleading. We know that performance testing is supposed to tell us how a system holds up under real-world usage, but what often ends up happening is the testing a simplified model that does not really reflect how users actually behave.

Take user behavior, for example. Real users don’t all behave the same way. A school app might be used mostly by students, followed by teachers, and only occasionally by admins or IT. If your load test simulates a uniform set of actions across evenly distributed users, you're not testing reality.. you’re testing a fantasy.

In terms of transaction behavior...not every function in an app gets equal use. Logging in, assigning homework, checking grades...those are daily-use functions. Others, like applying for a school trip or editing immunization records, happen rarely. But those rare actions don’t need to be in your main simulation, they’re not what’s going to crash your system on Monday morning.

Browser behavior is also often overlooked. Real browsers do a lot of optimization behind the scenes (loading resources in parallel, caching static files, managing cookies). If your testing tool isn’t mimicking these patterns, your tests are essentially stress tests, not performance simulations. Same thing with think time: humans pause! We read things, we hesitate before clicking, we take time to fill out forms. When your test scripts fire requests back-to-back with no delay, you're artificially inflating the load!

Lastly, I want to talk about server environment. If your test is running against a staging setup that’s less powerful than production, or configured differently, then your results can even be dangerous. You might either falsely panic or worse, falsely reelax.

TLDR: Performance testing only matters if it’s realistic. If you want actionable results, simulate actual user behavior with all its quirks (delays, caches, traffic patterns, and contextual priorities). Otherwise, you’re just collecting numbers that don’t reflect what users will experience.

What kinds of mistakes have you seen teams make that made performance tests useless? Or any stories where something passed in test but fell apart in prod?

https://redd.it/1lovn46
@r_devops
Dev/CloudOps Contracts

Hi, I have some free time together with a colleague, and we would like to take on some short-term or long-term contracts or projects in the DevOps/CloudOps area. Where is the best place to look for such opportunities?

https://redd.it/1lowl35
@r_devops
Announcing the Open Source Terraform Provider for OpenAI

I have an exciting announcement to make - we've just open sourced Terraform Provider for OpenAI. It covers most, if not all, resources that can be managed via an API - you can now provision your projects and service accounts as code, manage user access as code and do some fun GenAI automations as code. Check out the full announcement - including a demo of generating new Internet-available AWS Lambda Functions, with the code generated via the OAI provider and then passed to the Lambda deployment :)

https://mkdev.me/posts/announcing-the-open-source-terraform-provider-for-openai

https://redd.it/1loxtjm
@r_devops