Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
PSA: Crossplane API version migrations can completely brick your cluster (and how I survived it)

Just spent 4 hours recovering from what started as an "innocent" Lambda Permission commit. Thought this might save someone else's Thursday.

What happened: Someone committed a Crossplane resource using `lambda.aws.upbound.io/v1beta1`, but our cluster expected v1beta2. The conversion webhook failed because the loggingConfig field format changed from a map to an array between versions.

The death spiral:

Error: conversion webhook failed: cannot convert from spoke version "v1beta1" to hub version "v1beta2":
value at field path loggingConfig must be any, not "mapstringinterface {}"

This error completely locked us out of ALL Lambda Function resources:

`kubectl get functions` → webhook error
kubectl delete functions → webhook error
Raw API calls → still blocked
ArgoCD stuck in permanent Unknown state

Standard troubleshooting that DIDN'T work:

Disabling validating webhooks
Hard refresh ArgoCD
Patching resources directly
Restarting provider pods

What finally worked (nuclear option):

bash
# Delete the entire CRD - this removes ALL lambda functions
kubectl delete crd functions.lambda.aws.upbound.io --force --grace-period=0

# Wait for Crossplane to recreate the CRD
kubectl get pods -n crossplane-system

# Update your manifests to v1beta2 and fix loggingConfig format:
# OLD: loggingConfig: { applicationLogLevel: INFO }
# NEW: loggingConfig: { applicationLogLevel: INFO }

# Then sync everything back

Key lesson: When Crossplane conversion webhooks fail, they can create a catch-22 where you can't access resources to fix them, but you can't fix them without accessing them. Sometimes nuking the CRD is the only way out.

Anyone else hit this webhook deadlock? What was your escape route?

Edit: For the full play-by-play of this disaster, I wrote it up here if you're into technical war stories.

https://redd.it/1lnor51
@r_devops
Can you cut observability bill by 50% with an eBPF-first stack?

Datadog costs. **A lot.**

Companies are paying more for telemetry than some production workloads. I’ve been researching how SaaS teams are quietly cutting 30–70% of their observability costs by replacing per-host agents with kernel-native tooling.

Companies like [EX.CO](https://EX.CO) and open-source adopters using [SigNoz ](https://signoz.io/)are moving away from Datadog + CloudWatch and adopting **eBPF-first architectures** that are leaner, faster, and significantly cheaper.

# Stack shift

**Replace:**
• Datadog APM
• CloudWatch Logs
• CloudWatch Metrics

**With:**
• Cilium + Hubble (network flows)
• Pixie + Parca (profiling/traces)
• ClickHouse or Iceberg (raw storage)

**Result:**
• Zero sidecars
• < 1% CPU overhead
• Usage-based pipelines instead of per-host licenses

# Key takeaways

* eBPF probes run once per node → < 1 % CPU, zero sidecars
* Usage-based pipelines (ClickHouse / Iceberg) beat per-host licences
* Removing duplicate log streams saved another 40 % ingest

# 6-week roadmap & KPIs

1. **Deploy Cilium/Hubble** in a non-prod cluster; export to ClickHouse or S3. *Target: < 1 % node overhead*
2. **Enable eBPF profiling** (Pixie/Parca); compare to language agents. *Target: span parity*
3. **Shadow live traffic**; validate SLOs. *Target: < 2 % trace drop*
4. **Disable Datadog log ingest** for eBPF-covered namespaces. *Target: GB/day ↓ 40 %*
5. **Remove per-pod agents**; right-size node groups. *Target: CPU-hrs ↓*
6. **Pipe trimmed streams** to Iceberg / Redshift streaming for long-term ML/BI. *Target: $/GB storage ↓ 80 %*

https://redd.it/1lnrr6i
@r_devops
what else?

RHCSA+K8s+AWS cloud practitioner & sysops+azure Az-900+terraform+ansible+git+docker.
what should i do next im still a fresh graduate looking for a job, any advices , what about remotely ?

https://redd.it/1lnu65o
@r_devops
Octopus Deploy Reviews... What's your feedback?

I'm curious about Octopus Deploy in practical DevOps settings... It seems to have great ratings especially for integration and support. While it gets praise for customizable steps and its UI, I’ve seen mentions of permissions headaches. If you've used it, what do you think: love it or hate it? How does it handle complex scaling? Any quirks I should know about? And with all the options out there, is it still worth using in 2025? Looking forward to this communities takes. I've gotten a ton of value as a lurker. Thanks in advance...

https://redd.it/1lnu62b
@r_devops
Ansible vs Terraform for idempotency?

This post assumes all of us are familiar with these two tools for infrastructure provisioning and configuration. This has been bugging me for a while. The shop I’m at is in hybrid cloud setup and I’ve been using both of these tools and finding out how terraform is becoming redundant slowly. Both of the tools are sold for their idempotency for provisioning and configuration.

Terraform handles idempotency using statefiles with a persistent data store.

Ansible handles idempotency with “gathering facts” in memory and avoid any drift.

Pardon my ignorance as this might have been ask in another angle in this sub. But why would I choose terraform over ansible for infrastructure provisioning at this point with the hassle of handling persistent statefiles when I can just do a dry run of ansible to see the state of my infrastructure all handled in memory?

https://redd.it/1lnx00o
@r_devops
Cloud SIEM

Irrespective of the costs associated with the tools, why would you choose any other Cloud SIEM tool over Datadog's Cloud SIEM?

https://redd.it/1lnyuy8
@r_devops
Best Practices for Prompt Testing — learned from companies like Anthropic and OpenAI

Hey everyone! 👋

After months of research and talking to AI teams at top companies, we've compiled everything we've learned about building robust testing frameworks for LLM applications into one comprehensive guide.
What's covered:

🔬LLM-as-a-Judge evaluation - How to scale quality assessment beyond manual review (with detailed implementation strategies)

📈 Statistical significance testing - Proper hypothesis testing for prompt comparisons (because gut feelings don't cut it in production)

🎯 Comprehensive test set design - Coverage strategies that actually catch edge cases before users do

Advanced techniques - Adversarial testing, performance testing, and production monitoring

Key insights from the research:

• Systematic prompt evaluation can improve model performance by 40-60%

• Failure rates can be reduced by up to 80% with proper testing

• Most teams are still winging it with manual spot-checks (don't be most teams)

Why this matters: As LLMs move from demos to production systems handling real user traffic, the "move fast and break things" approach becomes... problematic. The companies that are winning are the ones treating prompt engineering like actual engineering.

The guide includes real implementation examples, statistical analysis methods, and a practical roadmap for getting started (even if you're currently doing zero testing).

Link: https://usebanyan.com/news/prompt-testing-best-practices

Would love to hear about your experiences with prompt testing - what's worked, what hasn't, and what challenges you're facing. Always looking to learn from the community!

— The Banyan Team 🌳

https://redd.it/1lo25df
@r_devops
Python learning path

Hey guys wanted to learn python , for quite a while now, could someone please suggest any resources that are useful , I have worked with python a bit tweaking code here and there .
Could someone please share a course that they have found useful.
Also is it worth to put in learning efforts , especially when ai is there?

https://redd.it/1lo31ki
@r_devops
Certified Kubernetes Administrator (CKA) Exam Guide - V1.32 (2025)

Your ultimate resource for acing the CKA exam on your first attempt! This repo offers detailed explanations, hands-on labs, and essential study materials, empowering aspiring Kubernetes administrators to master their skills and achieve certification success. Unlock your Kubernetes potential today!

https://github.com/techwithmohamed/CKA-Certified-Kubernetes-Administrator



https://redd.it/1lo3aba
@r_devops
Got Amazon Devops 2 interview in a few days!

Got Amazon Devops 2 interview in a few days! Pls if someone can help me with what to prepare and what type of questions I can expect in the interview. Thank you

https://redd.it/1lo4p8n
@r_devops
I'm getting an error after certificate renewal please help

Hello,
My Kubernetes cluster was running smoothly until I tried to renew the certificates after they expired. I ran the following commands:

>sudo kubeadm certs renew all

>echo 'export KUBECONFIG=/etc/kubernetes/admin.conf' >> \~/.bashrc

>source \~/.bashrc


After that, some abnormalities started to appear in my cluster. Calico is completely down and even after deleting and reinstalling it, it does not come back up at all.

When I check the daemonsets and deployments in the kube-system namespace, I see:

>kubectl get daemonset -n kube-system

>NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE

>calico-node 0 0 0 0 0 kubernetes.io/os=linux 4m4s

>

>kubectl get deployments -n kube-system

>NAME READY UP-TO-DATE AVAILABLE AGE

>calico-kube-controllers 0/1 0 0 4m19s


Before this, I was also getting "unauthorized" errors in the kubelet logs, which started after renewing the certificates. This is definitely abnormal because the pods created from deployments are not coming up and remain stuck.

There is no error message shown during deployment either. Please help.

https://redd.it/1lo52hc
@r_devops
Update: DockedUp v1.0.0 release, check the demo once !!!

Hey r/devops!

Last week I introduced **DockedUp** — a real-time, interactive terminal dashboard for managing Docker containers. Thanks so much for the support and feedback! 🙌

I’ve just pushed a big update with performance improvements, better logs, and smoother UI — plus a new demo to show it off:

**Check out the new demo GIF**

### Install via pip or pipx:
    pipx install dockedup
### or
pip install dockedup

### Then just run:
    dockedup

#### Links:

GitHub: [github.com/anilrajrimal1/dockedup](https://github.com/anilrajrimal1/dockedup)
PyPI: pypi.org/project/dockedup

https://redd.it/1loa4j8
@r_devops
Suggestions for an innovation sprint project? What useful new concepts or tech is 'trending'?

We are planning an innovation sprint (1 week to create a demo/PoC for a green-field project, 1 week to finalise, prep slides and demonstrate) and are at the ideas stage. I had hard plans of what I wanted to use the time for which were completely trainwrecked by a late directive to fit RnD tax credits.

I'm now in a position where I am absolutely uninterested and would like some help taking back some control of this valuable time - and not get roped in as a 6th person working on a 'support hub chat bot' project.

Any suggestions for things to consider?
\- Is there somewhere I follow for good coverage of new trends and evolution in the DevOps field?
\- We have aks clusters in azure for deployments without any tools like Kubecost implemented. Could be a good way to brush up on my k8s/helm knowledge and deliver something that would look good in my annual review if it manages any costs savings?

Thanks for any advice!

https://redd.it/1loa5w5
@r_devops
Is it possible to route non http traffic by DNS with Istio

My assumption is no, but maybe there’s something that would work

Let’s say I have a JDBC connection for 3 databases db1.com, db2.com, db3.com

In K8 with istio virtual services/gateway (without multiple load balancers) is it possible for all 3 connections to listen on tcp 5432 and then route to a db in a specific namespace

Example, assume the LB in the 3 is the exact same

User (db1) —> LB(5432) —> namespace 1

User (db2) —> LB(5432) —> namespace 2

User (db3) —> LB(5432) —> namespace 3

My assumption as this isn’t http we’d be looking at L4 meaning the DNS would be unknown to us/not usable.

Is this correct? Is there anyway to do the above for a DB tcp connection with a single LB/port but route to namespaces based on the DNS name?

https://redd.it/1lodag9
@r_devops
Good observability tooling doesn’t mean teams actually understand it

Been an engineering manager at a large org for close to three years now. We’re not exactly a “digitally native” company, but we have \~5K developers. Platform org has solid observability tooling (LGTM stack, decent golden paths).

What I keep seeing though - both in my team and across the org - is that product engineers rarely understand the nuances of the “three pillars” of observability - logs, metrics, and traces.

Not because they’re careless, but because their cognitive budget is limited. They're focused on delivering product value, and learning three completely different mental models for telemetry is a real cost.

Even with good platform support, that knowledge gap has real implications -

* Slower incident response and triage
* Platform teams needing to educate and support a lot more
* Alert fatigue and poor signal-to-noise ratios

I wrote up [some thoughts](https://open.substack.com/pub/musingsonsoftware/p/org-implications-of-contemporary?r=57p3s&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true) on why these three pillars exist (hint - it’s storage and query constraints) and what that means for teams trying to build observability maturity -

* Metrics, logs, and traces are separate because they store and query data differently.
* That separation forces dev teams to learn three mental models.
* Even with “golden path” tooling, you can’t fully outsource that cognitive load.
* We should be thinking about unified developer experience, not just unified tooling.

Curious if others here have seen the same gap between tooling maturity and team understanding and if you do I'm eager to understand how you address it in your orgs.

https://redd.it/1loes4q
@r_devops
VENT Seeing engineers use LLMs to generate all the code that I used to write for them is concerning

One of our engineering directors decided to spin up a new service. Within minutes, he was able to produce the scripts / terraform to bring up the infra for these services, along with the scripts to deploy them. It’s very clear that this code was written by an LLM

It’s good, clean code too.

This is all stuff that I used to do, and I am realizing that pretty soon I will no longer be needed for this set of tasks.

This leads me to wonder what types of tasks I should focus on so as not to get automated away entirely.

I'm not trying to be a luddite or an alarmist. It's great that these tools have enabled higher productivity, and honestly writing those types of scripts was never particularly fun or engaging. Just trying to stay ahead of getting eaten by the AI bear.

https://redd.it/1lodvtm
@r_devops
Monday Questions - r/DevOptimize

r/DevOptimize is taking questions on making delivery simpler and packaging. Feel free to ask here or there.

* Are your deploys more steps than "install packages; per-env config; start services"? more than 100 lines?
* Do you have separate IaC source repos or branches for each environment? Let's discuss!
* Do you have more than two or three layers in your container build?

https://redd.it/1loi1wt
@r_devops
Are we supposed to know everything?

I used to think DevOps interviews would focus on CI/CD, observability, and maybe some k8s troubleshooting.
Then came a “design a distributed key-value store” question. My brain just… rebooted.

It’s not that I didn’t know what quorum or replication meant. But I hadn’t reviewed consensus protocols since college. I fumbled the difference between consistency and availability under pressure.

That interview was a wake-up call: if you're applying to DevOps roles that lean heavy on the “dev,” you will be asked to reason through failure models, caching layers, GC behavior, or how your system handles 4x traffic spikes without falling over.

Since then, I’ve been treating system design prep like a separate skill. I watch ByteByteGo on 1.5x speed. I sketch distributed tracing pipelines in Notion. I’ve also been using Beyz coding assistant to walk through mock scenarios. The kind where you have to balance tradeoffs and justify design choices on the fly.

It’s not about memorizing Raft vs Paxos. It’s about showing that you can ask good questions, make sane decisions, and evolve your design when requirements shift. (Also, knowing when not to build a whole new infra stack just to sound smart.)

System design interviews aren't going away. But neither is your ability to improve. Anyone else trying to "relearn" distributed systems after years of just... shipping YAML?

https://redd.it/1loj6m2
@r_devops
I need an UDP load balancer that can retry on timeouts

Greetings, friends,

Recently, I've been frantically searching for a solution to my problem:

I have a system that is composed of multiple servers that receive UDP packets and send back responses.

I need a load balancer that can also retry sending the UDP packet if no response comes back to it within 3 milliseconds. I need to check for ANY response, no parsing or anything.

I know that no response is to be expected from UDP, however, unfortunately, that is exactly what I need, otherwise, I have some edge cases where I no longer have 100% availability.

So far, I'm using Envoy Proxy, however, it does not support such a functionality for UDP.

I looked into potentially extending Envoy proxy, to create a custom UDP filter with these retries, however, it seems to be a pretty daunting task.

I couldn't even compile Envoy to begin with. It took 4 hours and ended in an error.

Does anyone know of any solution that could help achieve this? A LOT of traffic needs to be handled.



https://redd.it/1loix24
@r_devops
Snyk free plan limits

Hi there,

I'm currently using Snyk on a private GitHub repository integrated with my GitHub Actions pipeline. Although I've exceeded the usage limits of the free plan by quite a bit, everything still seems to be working without issue.

Does anyone know why that might be the case? Should I expect the scans to stop working suddenly, or is there typically some buffer or grace period before enforcement?

Thanks in advance!

https://redd.it/1lohq5q
@r_devops
Deploying OpenStack on Azure VMs — Common Practice or Overkill?

Hey everyone,

I recently started my internship as a junior cloud architect, and I’ve been assigned a pretty interesting (and slightly overwhelming) task:
Set up a private cloud using OpenStack, but hosted entirely on Azure virtual machines.

Before I dive in too deep, I wanted to ask the community a few important questions:

1. Is this a common or realistic approach?
Using OpenStack on public cloud infrastructure like Azure feels a bit counterintuitive to me. Have you seen this done in production, or is it mainly used for learning/labs?


2. Does it help reduce costs, or can it end up being more expensive than using Azure-native services or even on-premise servers?


3. How complex is this setup in terms of architecture, networking, maintenance, and troubleshooting?
Any specific challenges I should be prepared for?


4. What are the best practices when deploying OpenStack in a public cloud environment like Azure? (e.g., VM sizing, network setup, high availability, storage options…)


5. Is OpenStack-Ansible a good fit for this scenario, or should I consider other deployment tools like Kolla-Ansible or DevStack?


6. Are there security implications I should be especially careful about when layering OpenStack over Azure?


7. If anyone has tried this before — what lessons did you learn the hard way?



If you’ve got any recommendations, links, or even personal experiences, I’d really appreciate it. I'm here to learn and avoid as many beginner mistakes as possible 😅

Thanks a lot in advance!

https://redd.it/1lol38q
@r_devops