Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Octopus Deploy Reviews... What's your feedback?

I'm curious about Octopus Deploy in practical DevOps settings... It seems to have great ratings especially for integration and support. While it gets praise for customizable steps and its UI, I’ve seen mentions of permissions headaches. If you've used it, what do you think: love it or hate it? How does it handle complex scaling? Any quirks I should know about? And with all the options out there, is it still worth using in 2025? Looking forward to this communities takes. I've gotten a ton of value as a lurker. Thanks in advance...

https://redd.it/1lnu62b
@r_devops
Ansible vs Terraform for idempotency?

This post assumes all of us are familiar with these two tools for infrastructure provisioning and configuration. This has been bugging me for a while. The shop I’m at is in hybrid cloud setup and I’ve been using both of these tools and finding out how terraform is becoming redundant slowly. Both of the tools are sold for their idempotency for provisioning and configuration.

Terraform handles idempotency using statefiles with a persistent data store.

Ansible handles idempotency with “gathering facts” in memory and avoid any drift.

Pardon my ignorance as this might have been ask in another angle in this sub. But why would I choose terraform over ansible for infrastructure provisioning at this point with the hassle of handling persistent statefiles when I can just do a dry run of ansible to see the state of my infrastructure all handled in memory?

https://redd.it/1lnx00o
@r_devops
Cloud SIEM

Irrespective of the costs associated with the tools, why would you choose any other Cloud SIEM tool over Datadog's Cloud SIEM?

https://redd.it/1lnyuy8
@r_devops
Best Practices for Prompt Testing — learned from companies like Anthropic and OpenAI

Hey everyone! 👋

After months of research and talking to AI teams at top companies, we've compiled everything we've learned about building robust testing frameworks for LLM applications into one comprehensive guide.
What's covered:

🔬LLM-as-a-Judge evaluation - How to scale quality assessment beyond manual review (with detailed implementation strategies)

📈 Statistical significance testing - Proper hypothesis testing for prompt comparisons (because gut feelings don't cut it in production)

🎯 Comprehensive test set design - Coverage strategies that actually catch edge cases before users do

Advanced techniques - Adversarial testing, performance testing, and production monitoring

Key insights from the research:

• Systematic prompt evaluation can improve model performance by 40-60%

• Failure rates can be reduced by up to 80% with proper testing

• Most teams are still winging it with manual spot-checks (don't be most teams)

Why this matters: As LLMs move from demos to production systems handling real user traffic, the "move fast and break things" approach becomes... problematic. The companies that are winning are the ones treating prompt engineering like actual engineering.

The guide includes real implementation examples, statistical analysis methods, and a practical roadmap for getting started (even if you're currently doing zero testing).

Link: https://usebanyan.com/news/prompt-testing-best-practices

Would love to hear about your experiences with prompt testing - what's worked, what hasn't, and what challenges you're facing. Always looking to learn from the community!

— The Banyan Team 🌳

https://redd.it/1lo25df
@r_devops
Python learning path

Hey guys wanted to learn python , for quite a while now, could someone please suggest any resources that are useful , I have worked with python a bit tweaking code here and there .
Could someone please share a course that they have found useful.
Also is it worth to put in learning efforts , especially when ai is there?

https://redd.it/1lo31ki
@r_devops
Certified Kubernetes Administrator (CKA) Exam Guide - V1.32 (2025)

Your ultimate resource for acing the CKA exam on your first attempt! This repo offers detailed explanations, hands-on labs, and essential study materials, empowering aspiring Kubernetes administrators to master their skills and achieve certification success. Unlock your Kubernetes potential today!

https://github.com/techwithmohamed/CKA-Certified-Kubernetes-Administrator



https://redd.it/1lo3aba
@r_devops
Got Amazon Devops 2 interview in a few days!

Got Amazon Devops 2 interview in a few days! Pls if someone can help me with what to prepare and what type of questions I can expect in the interview. Thank you

https://redd.it/1lo4p8n
@r_devops
I'm getting an error after certificate renewal please help

Hello,
My Kubernetes cluster was running smoothly until I tried to renew the certificates after they expired. I ran the following commands:

>sudo kubeadm certs renew all

>echo 'export KUBECONFIG=/etc/kubernetes/admin.conf' >> \~/.bashrc

>source \~/.bashrc


After that, some abnormalities started to appear in my cluster. Calico is completely down and even after deleting and reinstalling it, it does not come back up at all.

When I check the daemonsets and deployments in the kube-system namespace, I see:

>kubectl get daemonset -n kube-system

>NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE

>calico-node 0 0 0 0 0 kubernetes.io/os=linux 4m4s

>

>kubectl get deployments -n kube-system

>NAME READY UP-TO-DATE AVAILABLE AGE

>calico-kube-controllers 0/1 0 0 4m19s


Before this, I was also getting "unauthorized" errors in the kubelet logs, which started after renewing the certificates. This is definitely abnormal because the pods created from deployments are not coming up and remain stuck.

There is no error message shown during deployment either. Please help.

https://redd.it/1lo52hc
@r_devops
Update: DockedUp v1.0.0 release, check the demo once !!!

Hey r/devops!

Last week I introduced **DockedUp** — a real-time, interactive terminal dashboard for managing Docker containers. Thanks so much for the support and feedback! 🙌

I’ve just pushed a big update with performance improvements, better logs, and smoother UI — plus a new demo to show it off:

**Check out the new demo GIF**

### Install via pip or pipx:
    pipx install dockedup
### or
pip install dockedup

### Then just run:
    dockedup

#### Links:

GitHub: [github.com/anilrajrimal1/dockedup](https://github.com/anilrajrimal1/dockedup)
PyPI: pypi.org/project/dockedup

https://redd.it/1loa4j8
@r_devops
Suggestions for an innovation sprint project? What useful new concepts or tech is 'trending'?

We are planning an innovation sprint (1 week to create a demo/PoC for a green-field project, 1 week to finalise, prep slides and demonstrate) and are at the ideas stage. I had hard plans of what I wanted to use the time for which were completely trainwrecked by a late directive to fit RnD tax credits.

I'm now in a position where I am absolutely uninterested and would like some help taking back some control of this valuable time - and not get roped in as a 6th person working on a 'support hub chat bot' project.

Any suggestions for things to consider?
\- Is there somewhere I follow for good coverage of new trends and evolution in the DevOps field?
\- We have aks clusters in azure for deployments without any tools like Kubecost implemented. Could be a good way to brush up on my k8s/helm knowledge and deliver something that would look good in my annual review if it manages any costs savings?

Thanks for any advice!

https://redd.it/1loa5w5
@r_devops
Is it possible to route non http traffic by DNS with Istio

My assumption is no, but maybe there’s something that would work

Let’s say I have a JDBC connection for 3 databases db1.com, db2.com, db3.com

In K8 with istio virtual services/gateway (without multiple load balancers) is it possible for all 3 connections to listen on tcp 5432 and then route to a db in a specific namespace

Example, assume the LB in the 3 is the exact same

User (db1) —> LB(5432) —> namespace 1

User (db2) —> LB(5432) —> namespace 2

User (db3) —> LB(5432) —> namespace 3

My assumption as this isn’t http we’d be looking at L4 meaning the DNS would be unknown to us/not usable.

Is this correct? Is there anyway to do the above for a DB tcp connection with a single LB/port but route to namespaces based on the DNS name?

https://redd.it/1lodag9
@r_devops
Good observability tooling doesn’t mean teams actually understand it

Been an engineering manager at a large org for close to three years now. We’re not exactly a “digitally native” company, but we have \~5K developers. Platform org has solid observability tooling (LGTM stack, decent golden paths).

What I keep seeing though - both in my team and across the org - is that product engineers rarely understand the nuances of the “three pillars” of observability - logs, metrics, and traces.

Not because they’re careless, but because their cognitive budget is limited. They're focused on delivering product value, and learning three completely different mental models for telemetry is a real cost.

Even with good platform support, that knowledge gap has real implications -

* Slower incident response and triage
* Platform teams needing to educate and support a lot more
* Alert fatigue and poor signal-to-noise ratios

I wrote up [some thoughts](https://open.substack.com/pub/musingsonsoftware/p/org-implications-of-contemporary?r=57p3s&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true) on why these three pillars exist (hint - it’s storage and query constraints) and what that means for teams trying to build observability maturity -

* Metrics, logs, and traces are separate because they store and query data differently.
* That separation forces dev teams to learn three mental models.
* Even with “golden path” tooling, you can’t fully outsource that cognitive load.
* We should be thinking about unified developer experience, not just unified tooling.

Curious if others here have seen the same gap between tooling maturity and team understanding and if you do I'm eager to understand how you address it in your orgs.

https://redd.it/1loes4q
@r_devops
VENT Seeing engineers use LLMs to generate all the code that I used to write for them is concerning

One of our engineering directors decided to spin up a new service. Within minutes, he was able to produce the scripts / terraform to bring up the infra for these services, along with the scripts to deploy them. It’s very clear that this code was written by an LLM

It’s good, clean code too.

This is all stuff that I used to do, and I am realizing that pretty soon I will no longer be needed for this set of tasks.

This leads me to wonder what types of tasks I should focus on so as not to get automated away entirely.

I'm not trying to be a luddite or an alarmist. It's great that these tools have enabled higher productivity, and honestly writing those types of scripts was never particularly fun or engaging. Just trying to stay ahead of getting eaten by the AI bear.

https://redd.it/1lodvtm
@r_devops
Monday Questions - r/DevOptimize

r/DevOptimize is taking questions on making delivery simpler and packaging. Feel free to ask here or there.

* Are your deploys more steps than "install packages; per-env config; start services"? more than 100 lines?
* Do you have separate IaC source repos or branches for each environment? Let's discuss!
* Do you have more than two or three layers in your container build?

https://redd.it/1loi1wt
@r_devops
Are we supposed to know everything?

I used to think DevOps interviews would focus on CI/CD, observability, and maybe some k8s troubleshooting.
Then came a “design a distributed key-value store” question. My brain just… rebooted.

It’s not that I didn’t know what quorum or replication meant. But I hadn’t reviewed consensus protocols since college. I fumbled the difference between consistency and availability under pressure.

That interview was a wake-up call: if you're applying to DevOps roles that lean heavy on the “dev,” you will be asked to reason through failure models, caching layers, GC behavior, or how your system handles 4x traffic spikes without falling over.

Since then, I’ve been treating system design prep like a separate skill. I watch ByteByteGo on 1.5x speed. I sketch distributed tracing pipelines in Notion. I’ve also been using Beyz coding assistant to walk through mock scenarios. The kind where you have to balance tradeoffs and justify design choices on the fly.

It’s not about memorizing Raft vs Paxos. It’s about showing that you can ask good questions, make sane decisions, and evolve your design when requirements shift. (Also, knowing when not to build a whole new infra stack just to sound smart.)

System design interviews aren't going away. But neither is your ability to improve. Anyone else trying to "relearn" distributed systems after years of just... shipping YAML?

https://redd.it/1loj6m2
@r_devops
I need an UDP load balancer that can retry on timeouts

Greetings, friends,

Recently, I've been frantically searching for a solution to my problem:

I have a system that is composed of multiple servers that receive UDP packets and send back responses.

I need a load balancer that can also retry sending the UDP packet if no response comes back to it within 3 milliseconds. I need to check for ANY response, no parsing or anything.

I know that no response is to be expected from UDP, however, unfortunately, that is exactly what I need, otherwise, I have some edge cases where I no longer have 100% availability.

So far, I'm using Envoy Proxy, however, it does not support such a functionality for UDP.

I looked into potentially extending Envoy proxy, to create a custom UDP filter with these retries, however, it seems to be a pretty daunting task.

I couldn't even compile Envoy to begin with. It took 4 hours and ended in an error.

Does anyone know of any solution that could help achieve this? A LOT of traffic needs to be handled.



https://redd.it/1loix24
@r_devops
Snyk free plan limits

Hi there,

I'm currently using Snyk on a private GitHub repository integrated with my GitHub Actions pipeline. Although I've exceeded the usage limits of the free plan by quite a bit, everything still seems to be working without issue.

Does anyone know why that might be the case? Should I expect the scans to stop working suddenly, or is there typically some buffer or grace period before enforcement?

Thanks in advance!

https://redd.it/1lohq5q
@r_devops
Deploying OpenStack on Azure VMs — Common Practice or Overkill?

Hey everyone,

I recently started my internship as a junior cloud architect, and I’ve been assigned a pretty interesting (and slightly overwhelming) task:
Set up a private cloud using OpenStack, but hosted entirely on Azure virtual machines.

Before I dive in too deep, I wanted to ask the community a few important questions:

1. Is this a common or realistic approach?
Using OpenStack on public cloud infrastructure like Azure feels a bit counterintuitive to me. Have you seen this done in production, or is it mainly used for learning/labs?


2. Does it help reduce costs, or can it end up being more expensive than using Azure-native services or even on-premise servers?


3. How complex is this setup in terms of architecture, networking, maintenance, and troubleshooting?
Any specific challenges I should be prepared for?


4. What are the best practices when deploying OpenStack in a public cloud environment like Azure? (e.g., VM sizing, network setup, high availability, storage options…)


5. Is OpenStack-Ansible a good fit for this scenario, or should I consider other deployment tools like Kolla-Ansible or DevStack?


6. Are there security implications I should be especially careful about when layering OpenStack over Azure?


7. If anyone has tried this before — what lessons did you learn the hard way?



If you’ve got any recommendations, links, or even personal experiences, I’d really appreciate it. I'm here to learn and avoid as many beginner mistakes as possible 😅

Thanks a lot in advance!

https://redd.it/1lol38q
@r_devops
GitHub action failing - Cannot read password despite clearly seeing it as GITHUBTOKEN

Hey guys,


Technical question here:

I am having an error where my GITHUB\
TOKEN is being seen. [ Tested by adding 'echo "${#GITHUB_TOKEN}" the pound symbol which outputs the length, obviously not the actual token \]

yet I am getting 'err: fatal: could not read Password for 'https://***@github.com': ' in my GitHub action logs when trying to run git pull.

git pull https://${GITHUBTOKEN}@github.com/x/x.git main

Banging my head across this for the past three hours. Below is how I grab the GITHUB TOKEN.



on:
push:
branches: [ main ]
jobs:
deploy:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Deploy to server
uses: appleboy/[email protected]
env:
GITHUB
TOKEN: ${{ secrets.GITHUBTOKEN }}
with:
host: ${{
secrets.HOST }}
username: ${{ secrets.USERNAME }}
key: ${{ secrets.SSH
PRIVATEKEY }}
port: ${{ secrets.PORT || 22 }}
envs: GITHUB
TOKEN
script: |

Thank you!


Mike









https://redd.it/1loje9q
@r_devops
How does your company define DevOps, SRE, and Platform Teams?

For context: I’ve been a software engineer for 20 years and got into DevOps over a decade ago. I’ve held a variety of roles since then, and one thing I’ve noticed is that every company seems to structure the “ops” side of the house differently. I’m curious as to how do other companies approach it?

At my current company, here’s how things are set up:

* **DevOps Team**: Owns cloud infrastructure, manages our CDK setup and CI/CD pipelines, and has a grab bag of other responsibilities.
* **SRE Team**: Functions more like a traditional NOC, handling day-to-day server support and managing observability. There's some overlap with the DevOps team, and the boundaries aren't always clear.
* **Platform Team**: Software engineers focused on building internal tools to support development and QA.

I’m still relatively new here, and the structure feels a bit unusual especially compared to the model laid out in Google’s SRE book. I’d love to hear how other companies are organizing things.

https://redd.it/1lomymp
@r_devops
Another team took my work to corporate leadership and now they're "leading" a global rollout while I'm cast to the shadows. I had zero knowledge of this until they failed to reverse-engineer and contacted me.

Let me start by saying I’m (early career) a year into this corporate job at a "billion-dollar" multinational company. I fully understand that any work I do while employed is legally the company's intellectual property. That said, this post is more about how I can take advantage of my contributions for my career rather than being brushed aside.

Long story short, I single-handedly modernized a legacy system used in my region, automated several processes, deployments, migrated infra to the cloud, introduced GitOps and proper CI/CD pipelines, and implemented monitoring dashboards with Prometheus+Grafana. This overhaul gained a lot of traction so much so that a team from another region requested I build the same system for them, tailored to their needs.

Now here’s where things got interesting. Apparently, while in conversations with this other region, someone higher up at the global level got access to my project and showed it to their boss who is just one level below the CEO. I still have no idea who this person is or how they even gained access to my work. Anyways, this corporate leader was so impressed that they decided the system should be rolled out globally as soon as possible. The person who shared my project then took it upon themselves to assign a team dedicated to replicating it for all regions.


Now this assigned team somehow managed to access my project (I genuinely suspect a security breach or admin-level involvement) and tried to reverse-engineer everything I built.. but failed. They then began trying to identify who was behind the project and eventually contacted my manager (the "official" project manager) by pulling him into a meeting without prior notice. Odd.

So my manager then decided to setup a proper call with this team with me involved this time. In this call, they basically came forward and requested us to provide all the code, tools, and cloud infrastructure so they can simply copy and paste it for all regions, as well as requesting several technical sessions. To make matters worse, they want me to handle all the IT bureaucratic processes for every region to get things set up (I can already see myself being roped into supporting all regions and not just my own at this point). However, I strongly believe this "replication" approach will be destined to fail as each region has different user requirements and processes not quite comparable to ours. And I also strongly believe they will struggle to get anything running, due to their limited technical and business knowledge of the processes, and the type of technical questions I was being asked.


Anyways, if this team rolls out my solution globally for each region, they’ll receive all the visibility and credit (they'll be hosting demo sessions with region leaders which for sure I wont be invited to), while I'll be essentially cast into the shadows. What’s frustrating is that I have full knowledge of the system and am responsible for it so why isn't my manager at least being the one leading this global rollout and not some random team?

I’ve been trying to indirectly nudge my manager to take ownership of the global rollout, instead of letting this new team take over. But I’m not sure how this will play out. The person who assigned this team is closer to the corporate leader, while my manager is a few steps lower in the hierarchy. So far, all he’s done is try to keep our regional manager informed of the situation playing out. Realistically, only the regional manager can mention this to the corporate leader, but I’m not confident that will happen.

My manager often says "how will this benefit the team?" But in this case, it’s clear he’s struggling to see any benefit in simply handing over our work to another team that will walk away with all the credit.

We’re still in the early stages, and I haven’t handed anything over yet. But I’m deeply