Reddit DevOps
270 subscribers
5 photos
31K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Anyone with experience comparing AWS and Oracle Cloud

Hello!
My team and I are currently exploring the possibility of switching from AWS to Oracle Cloud (OCI), and we have a few questions. We're specifically trying to compare the following services:

EKS (AWS) vs OKE (OCI) for Kubernetes
EC2 vs OCI Compute
AWS Load Balancers vs OCI Load Balancer

We're especially interested in hearing about:

Differences in performance and cost
Ease of setup and day-to-day management
Integration with other cloud services like IAM, autoscaling, monitoring, etc.
Data transfer costs – this is a big concern for us. AWS charges for most outbound traffic, while OCI offers a free monthly bandwidth quota (like 10TB, depending on region).
Any lessons learned or suggestions for switching from AWS to OCI

If anyone has experience working with both platforms, we’d really appreciate your insights. Thanks in advance!

https://redd.it/1l76bhe
@r_devops
Confusion on improving DevEx with platform engineering

Hey, so today we are using terraform across our org (a lot of copy and paste without centralized modules). We also have k8s and argocd. The problem today is that the process to create new services and infra for developers is not entirely smooth or clear.

We've been tasked with improving this process and making it easier and faster for developers to self service what they need. I've been exploring of things like crossplane etc would make sense, however that has just left me even more unsure.

Any suggestions on what has worked for you guys would be appreciated. Things are so opinionated these days that I often just end up going in circles 😅

https://redd.it/1l77r04
@r_devops
Has anyone been able to programatically grab the SHA256 file for Telegraf?

Hello,

This is a bit of a weird ask, but I'm trying to full automate the updates of our telegraf service on a Windows server, but Telegraf's SHA256 file is sitting behind a JavaScript button for some reason.

Has anyone been able to automate the download & verification of the newest telegraf SHA file? I've mostly got it, but the SHA file sitting behind a weird JS component is the one hitch in my steps.

https://redd.it/1l761cp
@r_devops
Rate My Idea !! A temporary app hosting service — just a resume project, not a startup

Hey everyone,

So I’ve been learning DevOps for a while now, and instead of just following tutorials or deploying sample apps, I thought of building something a bit more real-world.

The idea is pretty simple — a platform where anyone can deploy their GitHub project (frontend/backend) and host it temporarily for 1 day. After that, the app gets removed automatically.

Basically:

You give a GitHub link
Jenkins pulls it, builds it using Docker
It gets hosted on my server with a unique port or subdomain
You get the link via email
After 24 hours, the app is removed from the server

Only 4–5 apps will be live at a time, just to keep it manageable on my VPS. The main goal is to learn proper CI/CD, automation, container handling, cleanup scripts, and also make something that others can try out.

Not trying to launch a startup or anything — just a hands-on project to showcase on my resume and maybe help other devs who want a quick place to test or show their app.

I just want to know:

Is this idea worth building?
Any suggestions on what I can improve or add?
Anything that could go wrong or I should handle better?

Thanks in advance 🙏 Just trying to learn and build something useful for the dev community.

https://redd.it/1l7ahex
@r_devops
Anyone here tried Rafay’s GPU PaaS stack for managing AI infra?

Been seeing more mentions of Rafay's GPU PaaS push for AI workloads. Curious if anyone here has used their platform or evaluated it?

How does it stack up against Sagemaker or any other solution?

https://redd.it/1l7ddl6
@r_devops
AWS Cognito authentication with Keycloak as 3rd party IdP

Hi everyone, I am not sure this is the right place to ask but hopefully someone could give a helping hand and suggestion on my current setup. It is kinda rigid for this condition.


So I am using the AWS Cognito as the Authentication/Authorization for the web application. But I noticed that the users are all on AWS which is not a good practice to manage the users while our application are using Keycloak as the IdP. So I decided to integrate Keycloak as the external provider in AWS Cognito to see how's going. So far I have integrated and User can login ( testing mode with the default AWS login page).



But I noticed that when I checked the user ID token, it does not come with several attributes that I need most to put them into different groups on Cognito. I use the Pre token generation method with Lambda function to assign the custom attribute into the user ID token, but it did not work. first, the default id token does not come with the realm_role attribute to determine the role of the user, and second I could not create a custom field for the user ID token no matter what I did with the example AWS provided. I am not sure if there is the actual limitation/restriction that AWS Cognito exist with the 3rd party IdP setup.


I am not sure if there is any direct solution that can help to resolve this issue. I have a work-around idea but it sounds like weird.. Like making an API call to the keycloak to get all user's required attribute and dump into the S3 bucket and then there is background job or event-driven method to trigger lambda and somehow update the users membership and assign them to different groups. It sounds stupid as like a loop to complete the task.
May I know if there is anyone encountering this issue before? What would be your solution?

Thank you!

https://redd.it/1l7adu8
@r_devops
Logging Failed Writes/Reads in Redis (AWS Valkey cache)

We’re encountering issues in our Valkey cache where it’s not updating sometimes. Is there a way to log the failed writes and reads? I tried checking Cloudwatch but it doesn’t have native metrics to catch these failures.

https://redd.it/1l7qe65
@r_devops
A Complete Load Testing Setup with k6 and Grafana

I recently put together a modern load testing setup using k6 to run tests, and Grafana to visualise the results, with GitHub Actions for automation.

In my guide, I use Grafana Cloud's Prometheus Remote Write to keep things simple, but you can easily plug in your own self-hosted Grafana + Prometheus stack.

The setup includes:

Running k6 on a lightweight EC2 instance
Pushing metrics to the Prometheus Remote Write endpoint
Visualising test results in Grafana dashboards
Automating test runs for multiple services via GitHub Actions

It’s a DevOps-friendly, repeatable approach that works for QA and engineering teams alike.

Full guide here (with code & workflows): https://medium.com/@prateekjain.dev/modern-load-testing-for-engineering-teams-with-k6-and-grafana-4214057dff65?sk=eacfbfbff10ed7feb24b7c97a3f72a93

https://redd.it/1l7pytd
@r_devops
Claude Code under root and without Docker — permission-bypass CLI wrapper

Hi all,



I’ve built a small CLI wrapper around Claude Code that allows you to bypass all the usual restrictions and run it in environments that normally wouldn’t allow it — like under root, without Docker, or offline.



Main features:



* Always enables --dangerously-skip-permissions
* Fakes getIsDocker() and hasInternetAccess() responses
* Works fine under root
* Can run in headless/server environments
* Simple alias (cl) for quick usage





I know it’s a simple workaround, but I couldn’t find a working solution anywhere, so I figured I’d just make one and share it.



Still rough around the edges, but works well in practice.



GitHub repo:

[https://github.com/gagarinyury/claude-code-root-runner](https://github.com/gagarinyury/claude-code-root-runner)



Would love feedback or ideas if you have any.

https://redd.it/1l7t7xp
@r_devops
Finally solved GNOME's annoying multi-monitor workspace problem ( For me )



Been dealing with this for months on my 3-monitor setup. GNOME's workspace switching moves ALL monitors together, so when I switch contexts on my external displays, I lose my communication apps on the laptop screen. Drives me nuts.

Tried a bunch of existing extensions but nothing worked right. So I built my own.

**The fix:** Extension tracks which monitor your mouse is on. When you switch workspaces, only that monitor gets new content. The other monitors' windows automatically shift to keep everything in sync.

Example: I swipe left on my code monitor. My browser and terminal shift left too, but stay visible on their respective screens. No more losing Slack when I'm debugging.

**How it works:** Instead of blocking GNOME's workspace system (which breaks things), it works WITH it. Lets GNOME do the workspace change normally, then quickly moves windows around to maintain the illusion of per-monitor independence.

**Gotchas:**

* Requires static workspaces (not dynamic)
* Brief window animation when switching - it's not native behavior
* Your windows are technically moving between workspaces constantly, but you don't really notice

Took way longer than expected because GNOME really wasn't designed for this. Had to try 3 different approaches before finding one that didn't crash the shell.

Code's on GitHub if anyone wants to try it or improve it: [https://github.com/devops-dude-dinodam/smart-workspace-manager](https://github.com/devops-dude-dinodam/smart-workspace-manager)

Works great for my workflow now. Laptop stays on comms, externals switch contexts independently. Finally feels like macOS did this right and Linux caught up.

Anyone else solved this differently? Always interested in other approaches.

https://redd.it/1l7uks8
@r_devops
What's your role like? What are your responsibilities?

I'm the only senior devops person (edit. also only devops person in the company, there's no junior or mid, just me) in a small/medium company (10 devs, 60 employees total) and the developers know "some" things, just enough to apply some changes and create new resources in terraform, but I'm responsible for the following:

\- Azure (the whole tenant, security, kubernetes, vms, vnets, VPNs, etc... . Including AI provisioning and Fabric for example)

\- AKS clusters (k8s)

\- On-prem servers running kubernetes

\- Terraform creation and management for all the projects

\- CI/CD

\- General security knowledge and implementation

\- General automations

\- Backups

\- Developer help with setups and configurations (including when they have linux issues)

\- Of course help with restoring when services are down (whole aks or rabbitmq or nginx, etc...)

\- (basically everything that is not development of the services)

Sometimes I feel burnt out with all the context switching and different responsibilities. Sometimes i just slack cause I don't really have focus and mastering of one topic.

I have almost 15 years of experience in IT (development and ops), but 3 years ago I switched to a pure devops job, so I don't really have a frame of reference with other devops colleagues and other devops jobs to clearly say if it's normal responsibility and I'm just not putting enough effort, or if it's really too much.

What is the average devops person responsibility, and is this too much?

https://redd.it/1l7wte0
@r_devops
The "works on my machine" curse that nearly killed my DevOps career

Anyone else been here? You write bash scripts that run flawlessly on your laptop, then deploy to production and watch everything burn down?

Ten years ago I was THAT developer. The one everyone avoided during deployments. My scripts assumed Ubuntu paths, hardcoded my username, and had zero error handling.

The breaking point: Friday 3:47 PM deployment that took our main site down for 3 hours. My manager literally asked if I was "ready for production work."

That hurt. But it motivated me to figure out how to write bash that actually works in the real world.

Spent the last few years documenting the transformation from "works on my machine" developer to the person they call when production is on fire. Covered everything from environment validation to career impact.

Key lessons:

Your dev environment is lying to you
Error handling is the difference between junior and senior
Good bash scripts = professional credibility
Production-ready code got me promoted twice

Biggest realization: It's not just about the technical skills. It's about becoming someone your team trusts with critical systems.

So, I wrote up the full journey with before/after code examples, the specific mindset shifts that mattered, and how it changed my career trajectory, and the link to the article is here: https://medium.com/@heinancabouly/from-it-works-on-my-machine-to-production-hero-a-bash-journey-186e087d97bf?source=friends\_link&sk=89f9f53e1b21065d94631d24b04710f1




https://redd.it/1l7ym6l
@r_devops
Has anyone heard the term “multi-dimensional optimization” in Kubernetes? What does it mean to you?

Hey everyone,
I’ve been seeing the phrase “multi-dimensional optimization” pop up in some Kubernetes discussions and wanted to ask - is this a term you're familiar with? If so, how do you interpret it in the context of Kubernetes? Is that a more general approach to K8s optimization (that just means that you optimize several aspects of your environment concurrently), or does that relate to some specific aspect?

https://redd.it/1l7zqz2
@r_devops
Any efficient ways to cut noise in observability data?

Hey folks,

Anyone has solid strategies/solutions for cutting down observability data noise, especially in logs? We’re getting swamped with low-signal logs, especially from info/debug levels. It’s making it hard to spot real issues and spoofing storage costs.

We’ve tried some basic and cautious filtering (in order not to risk missing key events) and asking devs to log less, but the noise keeps creeping back.

Has anything worked for you?

Would love to hear what helped your team stay sane. Bonus points for horror stories or “aha” moments lol.

Thanks!

https://redd.it/1l811zt
@r_devops
Are you using Dev Containers?

I was wondering about these today. I have been using them on and off for a few years now for personal stuff, and they work pretty well. Integration with VScode is pretty good too, as a Microsoft backed spec, but I have had some stuff break on me in VScodium.

I was wondering if they have genuine widespread adoption, especially in professional settings, or if they are somewhat relegated to obscurity. The spec has \~4000 github stars, which is a lot but not as much as I would expect for something that could be relevant to every dev, especially if you are bought into the Microsoft development stack (Azure Devops, Github. Visual Studio, etc.)

So do you guys use these? I am always going back and forth on just rolling my own containers, but some of the built in stuff to VScode are great for quickly rolling these. I would be interested to hear what other people do.

https://redd.it/1l825eo
@r_devops
Which small cybersecurity company deserves way more attention?

Hey everyone,
I'm curious to hear your thoughts — which lesser-known or small cybersecurity companies do you think are really underrated or deserve way more attention than they’re getting?

I’m not talking about the big names like CrowdStrike, Palo Alto, or SentinelOne, but rather smaller, niche players doing innovative or impactful work. Whether it’s a company with a cool product, a solid team, or just a fresh approach to solving real security challenges — I’d love to learn more.

Looking forward to your recommendations!

https://redd.it/1l82rgf
@r_devops
Why Are GitOps Tools So Popular When Helmfile + GitHub Actions Are Simpler?

I’ve been working with Kubernetes for about 8 years, and I’ve used Helmfile in production enough to feel comfortable with it. It’s simple, declarative, and works well with GitHub Actions or any CI system. It’s easy to reason about, and in many cases, it just works.

I’ve also prototyped ArgoCD and Flux, and honestly… I don’t get the appeal.

From my perspective:

* GitOps tools introduce a lot of complexity: CRDs, controllers, syncing logic, and additional moving parts that can be hard to debug.
* Debugging issues in GitOps setups can be non-intuitive, especially when something silently drifts or fails to sync.
* Helmfile + CI/CD is transparent and flexible you know exactly what’s being applied and when.

What’s even more confusing is that I often see teams using CI tools alongside GitOps not because they want to, but because they have to. For example:

* GitOps tools don’t handle templating or secrets management directly, so you end up needing tools like External Secrets, which isn’t always appropriate.
* It’s also surprisingly difficult to pass output values from your IaC tool (like Terraform or Pulumi) into your cluster via GitOps. Tools like Crossplane try to bridge that gap, but in practice, it often feels convoluted and heavy for what should be a simple handoff.

And while I’ll admit the ArgoCD dashboard is nice, you can get a similar experience using something like Headlamp, which doesn’t even require installing anything in your cluster.

Another thing I don’t quite get is the strong preference for pull-based over push-based workflows. People say pull is “more secure” or “more GitOps-y,” but:

* It’s not difficult to keep cluster credentials safe in a push-based system.
* You often end up triggering syncs manually or via CI anyway.
* Push-based workflows are simpler to reason about and easier to integrate with IaC tools.

Yet GitOps seems to be the default recommendation everywhere Reddit, blogs, conference talks, etc. It feels like the popularity is driven more by:

1. Vendor marketing: GitOps tools are often backed by companies with strong incentives to push them. Think Akuity (ArgoCD), Codefresh, Control Plane, and previously Weaveworks (Flux).
2. Social momentum: Once a few big players adopt something, it becomes the “best practice.”
3. Buzzword appeal: “GitOps” sounds cool and modern, even if the underlying mechanics aren’t new.

Curious to hear from others:

* Have you used both GitOps tools and simpler CI/CD setups?
* What made you choose one over the other?
* Do you think GitOps is overhyped, or am I missing something?

https://redd.it/1l85yu8
@r_devops
Should I add links to public github repo's i've contributed to on my resume?

Been sprucing up the ol' resume as I'm not too thrilled where things are going at my current job. It's a shame too, as I love working with the team I have.

Currently, I am employed at a GCP centric consulting company. We are partnered with Google Cloud and we have done many projects for them. Over the course of the last two years I had a big hand in 2 major projects, which were eventually published by Google, now sitting in their official repositories. Out of the two, I authored one of them myself along with a data engineer, while the other I was part of a smaller team which I and two other engineers were responsible mainly for infrastructure (all terraform).

To me, a big milestone in my career. Obviously I would like to point it out on my resume. I'm a bit conflicted as to whether to add links to these repositories somewhere on my resume or not. I'm unsure if 1) the AI or algorithm HR uses will flag links on my resume and weed it out and 2) if it does pass, will managers will even bother looking at them.

https://redd.it/1l81w0r
@r_devops
CNCF, Your Certification Exams Are a Privileged, Ableist Joke — And I'm Done Pretending Otherwise

I’m sick of it.

These so-called "industry standard" Kubernetes certifications (CKA, CKAD, CKS) have become a monument to privilege, not merit. You want to prove your skills in Kubernetes? Cool. But apparently, first you need to prove you own a luxury apartment, live alone in a soundproof bunker, and don’t blink too much.

Let me break this down for the CNCF and their sanctimonious proctors:

Not everyone has a dedicated home office.

Not everyone can afford to book a quiet coworking space or even a hotel for a whole night just to take your absurdly strict exam.

Not everyone lives in a country where stable internet is guaranteed, or where the "exam spyware" even runs properly.

And some of us are disabled, neurodivergent, or otherwise unable to sit still and silent in front of a single screen while being eyeball-tracked by an AI that treats a sneeze like a felony.


You know what happens when I try to take the exam from my living room — which, by the way, is also my office, bedroom, and kitchen?

I get flagged because someone walked past the door.

I get banned for “looking away” to stretch my neck.

I get stressed out to hell before the exam even starts, just trying to pass the ridiculous room scan.

And then if the proctor’s software crashes, guess what? No refund. No re-entry. No second chance. Just another $395 down the drain.


Oh, and let’s talk about ableism, shall we?

People with ADHD, autism, mobility constraints, chronic pain — you’ve built a system that excludes them by default. Can’t sit still? Can’t control your eye movement? Can’t guarantee your kid won’t cry in the next room?

Too bad. No cert for you. Try again with a different life.

This isn’t “security.” It’s elitism wrapped in bureaucracy.
You know who passes these exams easily? People in tech hubs, with quiet apartments, corporate backing, expensive equipment, and no roommates.
You know who gets flagged, banned, or priced out? Everyone else.

So here’s a wild idea:
Make it fair. Make it accessible. Make it human.

Offer test centers.
Offer accommodations.
Stop treating remote exam-takers like criminals.
And while you’re at it, stop pretending like this system represents “the future of cloud.”

It represents the past, just with more invasive surveillance.

Signed,
One very pissed-off, cloud engineer
Who doesn’t need your cert to prove it
But wanted the badge anyway, before you made it a gatekeeping farce

https://redd.it/1l88uej
@r_devops
Monitoring showed green. Users were getting 502s. Turns out it was none of the usual suspects.

Ran into this with a client recently.

They were seeing random 502s and 503s. Totally unpredictable.
Code was clean. No memory leaks. CPU wasn’t spiking.
They were using Watchdog for monitoring and everything looked normal.

So the devs were getting blamed.

I dug into it and noticed memory usage was peaking during high-traffic periods.
But it would drop quickly just long enough to cause issues, but short enough to disappear before anyone saw it.

Turns out Watchdog was only sampling every 5 mins (and even slower for longer time ranges).
So none of the spikes were ever caught. Everything looked smooth on the graphs.

We swapped it out for Prometheus + Node Exporter and let it collect for a few hours.
There it was full memory saturation during peak times.

We set up auto scaling based on to handle peak traffic demands.
Errors gone. Devs finally off the hook.

Lesson: when your monitoring doesn’t show the pain, it’s not the code. It’s the visibility.

Anyway, just thought I’d share in case anyone’s been hit with mystery 5xxs and no clear root cause.

If you’re dealing with anything similar, I wrote up a quick checklist we used to debug this. DM me if you want a copy.

Also curious have you ever chased a bug and it ended up being something completely different than what everyone thought?

Would love to read your war stories.


https://redd.it/1l86ynq
@r_devops