Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Java vs python

What should I learn , Java or python, for DevOps.

I am really confused between these two languages.

Please help.

https://redd.it/1ln6gde
@r_devops
ISO 27001 Audit with a Self-Hosted Dashboard – Here’s the Behind-the-Scenes

Last week, I posted "How we left AWS, kept ISO 27001, and cut cloud costs by 90% (with Hetzner/OVH + Ansible stack)" and now I am back with a follow-up:

This self-hosted SaaS Passed Its ISO 27001 Audit: Here’s The Dashboard That Did It.

I built an internal dashboard to track every control, asset, risk, and audit trail, without paying for some overpriced compliance platform.

I wrote up the whole story (and included screenshots + methodology) here:

This self-hosted SaaS passed its ISO 27001 audit – here’s the dashboard that did it

If you’re bootstrapping, running open-source, or just hate “compliance theater”, this might be useful. Would love feedback, especially from others who’ve been through similar audits.

Note: \~80% of what I built is shared publicly across HN, Reddit comments, and the full breakdown on Medium (including screenshots + methodology). It’s an open build-in-public process that might help others skip overpriced compliance platforms.

I’m bootstrapping this and sharing the journey openly. There is an option to buy playbooks but it is not need to get value from my content. If that’s not the right vibe for this sub, I’ll take the feedback. No hard feelings.

https://redd.it/1ln990i
@r_devops
How do you deal with devs?

Basically I was hired in small company (about 50 it employees) as a devops engineer. I’m third devops in the company and our task is basically cleaning up all our apps and implementing best practices (IaC, CI/CD). We have a great ops team (i.e. sys admins) that support our vision but our devs are not so fond of it.
We have a lot manual deployments (git pull/ docker compose up), no ci/cd, no orchestration and just now are implementing vlans.
When we are suggesting improvements, like setting up nexus proxy repo to start preparing for disconnecting from docker hub or npm, we are often ignored and devs continue pulling packages directly from anywhere they want. When we are suggesting setting immutable docker tags (not latest of course) they oppose because “it’s too hard to track which version to assign if there’s >1 dev working in 1 project”.
How do you deal with such situations? I’m not sure we can support from C-suite since we are not an traditional IT company, more like a medtech with heavy focus on med and just improving tech side because it started working too bad (we had like 3-4 incidents per week about a year ago when leadership decided we need to invest in better infrastructure, observability, etc )

https://redd.it/1lna13j
@r_devops
How do you view the future?

I have seen opinions here and there about how DevOps as an idea will disappear soon with services trying to replace it and automate it and what not. While I am not a DevOps engineer, I felt intrigued to ask and understand as I always thoughts that DevOps was more of a company’s Frankenstein and not something for all.

And away from the AI drama, how do you view the future of DevOps? Will it transform? Is there a common channel for another role, like cloud engineer or SRE?

https://redd.it/1lnb78c
@r_devops
What have you found the most useful course you've taken?

For example, when I first was getting into the Cloud, I personally found Adrian Cantrill's course (for Solutions Architect Associate) really useful, both in the sense that it was teaching me about the Cloud, but also in the preliminary phase was teaching about tech in general, such as IPs (and how they're originally in octets), the OSI model, etc.

I'm a bit more advanced now. Some time ago I was studying for the CKA and I found Kodekloud's labs incredibly useful to understand Kubernetes.

Besides courses, obviously we learn on the spot, we have to write research spikes, we create good documentation... but what have you guys found to be the 'golden standard' or not even gold standard, just incredibly good or useful course in our field. (This can be the core of DevOps, or specializations, e.g. you were interested in SRE, so decided to read Google's SRE book, and then go through a XYZ course).

https://redd.it/1lnc7y3
@r_devops
Looking for DevOps job in Canada– any leads?

Hey folks,

I’m a DevOps engineer with 5+ years of experience (AWS, GitLab CI/CD, Docker, K8s, Terraform, etc.) currently looking for a new opportunity (open to hybrid/onsite or Remote ).

If you know of any companies hiring or can refer me somewhere, I’d really appreciate it! Happy to DM my resume or chat more.

Thanks in advance! 🙏

https://redd.it/1lncn4v
@r_devops
AI risk is growing faster than your controls?

Hey guys, I'm the founder of a company called Jozu, which is a model integrity platform. I've been noticing a bit of a trend when talking with companies that are looking at adopting our solution and am curious how prevalent this is.

The TL;DR is that AI models aren't governed like first-class assets (eg application code)

Your artifacts that scattered across Git, S3, HF Hub, MLflow, and Jupyter, your models aren't consistently versioned. Second, It's unclear who signs off on what goes into production, and auditing changes for your customers or regulators is a nightmare.

This is caused by ad-hoc promotion scripts, dependence on tribal knowledge, unclear rollback versioning and processes, fragile change and lineage tracking, and manual auditing across multiple systems.

Since ML maturity varies so much from org to org, that it's hard to know what is and isn't normal.

https://redd.it/1lncd78
@r_devops
Amazon Devops Role Experience

I recently applied for Devops role in Amazon. I'm not looking for switch but I'm targeting MAANG in coming years so I applied for this position to get at least experience of hiring process and surprisingly my resume got shortlisted and I received an assessment link.
There were user experience, work style and Devops related questions. I did good only in the last section but fortunately I received call from HR after 4 days from assessment. 🤞

She took all the basic details and asked me how good I'm at coding. I showed my stupidity here by being brutally honest. I replied that " I am mostly working on kubernetes and AI ML part in my company so In coding I would rate myself 6/10 "

And here we go.... Instant Regret ! 🥹
I never heard back from HR.
But now that my urge for these companies has already increased, I want to give another shot after few months.

I'm sharing this experience just to know how I can prepare myself and what skills I should develop
to stand out from crowd of experienced people.

Happy Learning !!!





https://redd.it/1lnez95
@r_devops
How do you handle trusted software delivery at a global scale?

Hey 👋
Right now I’m working on something pretty exciting (and a bit nerve-wracking, not gonna lie):

We have a global customer base, teams spread across Australia, the US, and Europe, and I need to build an infrastructure that ensures they can quickly and securely fetch container images from a registry that’s geographically close to them.

But speed isn’t enough.
I also need to guarantee that what they pull is exactly what I built, no tampering, no surprises, just trust.

So this isn’t just about performance, but it’s about authenticity and integrity.
When a customer deploys my software, I want them to know:

1. It came from us
2. It hasn’t been touched
3. It’s the version they expected

Still brainstorming the best way to approach this (edge replication? verified signatures? something more elegant?), but would love to hear how others tackled similar challenges.

How do you handle trusted software delivery at a global scale?

https://redd.it/1ln9tqb
@r_devops
>8YoE, majority of which at AWS Infra

So here's the thing. I quit from AWS after being abused at work. They keep contacting me to apply at their job postings. Of course, that's never going to happen.

I'm looking at the job market and almost all the postings are for seniors. I match most of the 5+ years of experience, though, I don't match on experience with AWS per se (I worked on internal infrastructure in AWS not on the cloud side - not to say I didn't use S3, DynamoDB, IAM, Cloudformation, SNS/SQS).

I'm at the moment working on DSA after having learned a bit of Kubernetes, Terraform, Docker and OpenAPI3.

Planning to start system design on educative.io this week after wrapping up DSA (arrays, linked lists, sorting). Leaving out BFS, DFS, BST, hash maps, DP - is this a good idea?

I'll get more AWS hands on experience with the labs I'll be doing with educative.io

What do you folks recommend since I don't have experience with Kubernetes/EKS in production and, similarly, using the other tools such as Terraform, Jenkins, Ansible, GitHub Actions and Docker in production?

I'm aiming for a job after 4 years and a half of being unemployed.

https://redd.it/1lni4ug
@r_devops
Has platform engineering quietly become the “new backend”?

Lately I’ve noticed more companies shifting engineering responsibilities toward platform teams — managing infra, CI/CD, observability, even spinning up internal dev tools and platforms-as-a-product.

Meanwhile, traditional backend roles seem to be getting squeezed between frontend-heavy full-stack positions and infrastructure-heavy platform roles.

Is this just me, or are platform teams slowly absorbing more of what used to be backend territory?

Curious if others are seeing the same trend — and how backend devs or SREs are adapting.

https://redd.it/1lnjsxs
@r_devops
The company I work for has made an internal custom Jenkins

Ok, here’s the thing, I work for an IT consultancy here in Spain, and some of the executives had the idea to create a custom Jenkins setup where agents are installed on isolated client nodes (they only have outbound access to a Jenkins job endpoint).

The catch is that the agents send system info or info related to isolated apps to a Jenkins job URL, and Jenkins then tells them to run certain scripts based on rules and input data (for example, if an email with a specific subject arrives and a user is logged in, don’t kick them out).

The thing is, they don’t want to go public with this but I keep telling my boss it’s a great Jenkins mod.

Is this due to corporate strategy? Or just plain ignorance?


https://redd.it/1lnk2xz
@r_devops
Just graduated – Need project ideas for my resume

Hey! I just finished my engineering degree and I’m looking to build 1–2 solid projects to help land my first job.

I’m thinking of starting with a Website Uptime Monitor. Do you think it’s a good idea for showcasing skills? Any other project suggestions that would stand out to employers?

Thanks!

https://redd.it/1lnkzav
@r_devops
App Support

Hello, i am building a new app, i am a product person and i have a software engineering supporting me. He is mostly familiar with AWS but i am open to any Cloud based platform. Could you please suggest a good stack for an app to be scalable but not massively costly at first ( being a start up) ideally on AWS or any other Cloud provider. Thanks

https://redd.it/1lnoaf4
@r_devops
Doing labs locally or AWS ?

Hi all,

I'm working on my skills on devops, doing git, CI/CD, ansible etc

Do you use AWS or doing it locally on a local VM ?



https://redd.it/1lnptvi
@r_devops
PSA: Crossplane API version migrations can completely brick your cluster (and how I survived it)

Just spent 4 hours recovering from what started as an "innocent" Lambda Permission commit. Thought this might save someone else's Thursday.

What happened: Someone committed a Crossplane resource using `lambda.aws.upbound.io/v1beta1`, but our cluster expected v1beta2. The conversion webhook failed because the loggingConfig field format changed from a map to an array between versions.

The death spiral:

Error: conversion webhook failed: cannot convert from spoke version "v1beta1" to hub version "v1beta2":
value at field path loggingConfig must be any, not "mapstringinterface {}"

This error completely locked us out of ALL Lambda Function resources:

`kubectl get functions` → webhook error
kubectl delete functions → webhook error
Raw API calls → still blocked
ArgoCD stuck in permanent Unknown state

Standard troubleshooting that DIDN'T work:

Disabling validating webhooks
Hard refresh ArgoCD
Patching resources directly
Restarting provider pods

What finally worked (nuclear option):

bash
# Delete the entire CRD - this removes ALL lambda functions
kubectl delete crd functions.lambda.aws.upbound.io --force --grace-period=0

# Wait for Crossplane to recreate the CRD
kubectl get pods -n crossplane-system

# Update your manifests to v1beta2 and fix loggingConfig format:
# OLD: loggingConfig: { applicationLogLevel: INFO }
# NEW: loggingConfig: { applicationLogLevel: INFO }

# Then sync everything back

Key lesson: When Crossplane conversion webhooks fail, they can create a catch-22 where you can't access resources to fix them, but you can't fix them without accessing them. Sometimes nuking the CRD is the only way out.

Anyone else hit this webhook deadlock? What was your escape route?

Edit: For the full play-by-play of this disaster, I wrote it up here if you're into technical war stories.

https://redd.it/1lnor51
@r_devops
Can you cut observability bill by 50% with an eBPF-first stack?

Datadog costs. **A lot.**

Companies are paying more for telemetry than some production workloads. I’ve been researching how SaaS teams are quietly cutting 30–70% of their observability costs by replacing per-host agents with kernel-native tooling.

Companies like [EX.CO](https://EX.CO) and open-source adopters using [SigNoz ](https://signoz.io/)are moving away from Datadog + CloudWatch and adopting **eBPF-first architectures** that are leaner, faster, and significantly cheaper.

# Stack shift

**Replace:**
• Datadog APM
• CloudWatch Logs
• CloudWatch Metrics

**With:**
• Cilium + Hubble (network flows)
• Pixie + Parca (profiling/traces)
• ClickHouse or Iceberg (raw storage)

**Result:**
• Zero sidecars
• < 1% CPU overhead
• Usage-based pipelines instead of per-host licenses

# Key takeaways

* eBPF probes run once per node → < 1 % CPU, zero sidecars
* Usage-based pipelines (ClickHouse / Iceberg) beat per-host licences
* Removing duplicate log streams saved another 40 % ingest

# 6-week roadmap & KPIs

1. **Deploy Cilium/Hubble** in a non-prod cluster; export to ClickHouse or S3. *Target: < 1 % node overhead*
2. **Enable eBPF profiling** (Pixie/Parca); compare to language agents. *Target: span parity*
3. **Shadow live traffic**; validate SLOs. *Target: < 2 % trace drop*
4. **Disable Datadog log ingest** for eBPF-covered namespaces. *Target: GB/day ↓ 40 %*
5. **Remove per-pod agents**; right-size node groups. *Target: CPU-hrs ↓*
6. **Pipe trimmed streams** to Iceberg / Redshift streaming for long-term ML/BI. *Target: $/GB storage ↓ 80 %*

https://redd.it/1lnrr6i
@r_devops
what else?

RHCSA+K8s+AWS cloud practitioner & sysops+azure Az-900+terraform+ansible+git+docker.
what should i do next im still a fresh graduate looking for a job, any advices , what about remotely ?

https://redd.it/1lnu65o
@r_devops
Octopus Deploy Reviews... What's your feedback?

I'm curious about Octopus Deploy in practical DevOps settings... It seems to have great ratings especially for integration and support. While it gets praise for customizable steps and its UI, I’ve seen mentions of permissions headaches. If you've used it, what do you think: love it or hate it? How does it handle complex scaling? Any quirks I should know about? And with all the options out there, is it still worth using in 2025? Looking forward to this communities takes. I've gotten a ton of value as a lurker. Thanks in advance...

https://redd.it/1lnu62b
@r_devops
Ansible vs Terraform for idempotency?

This post assumes all of us are familiar with these two tools for infrastructure provisioning and configuration. This has been bugging me for a while. The shop I’m at is in hybrid cloud setup and I’ve been using both of these tools and finding out how terraform is becoming redundant slowly. Both of the tools are sold for their idempotency for provisioning and configuration.

Terraform handles idempotency using statefiles with a persistent data store.

Ansible handles idempotency with “gathering facts” in memory and avoid any drift.

Pardon my ignorance as this might have been ask in another angle in this sub. But why would I choose terraform over ansible for infrastructure provisioning at this point with the hassle of handling persistent statefiles when I can just do a dry run of ansible to see the state of my infrastructure all handled in memory?

https://redd.it/1lnx00o
@r_devops
Cloud SIEM

Irrespective of the costs associated with the tools, why would you choose any other Cloud SIEM tool over Datadog's Cloud SIEM?

https://redd.it/1lnyuy8
@r_devops