Reddit DevOps
269 subscribers
2 photos
31K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Cert expired (again). Built a tool to stop the madness, Curious what DevOps folks think

You know that moment when everything breaks on a Sunday morning because someone forgot to renew a TLS cert?

Yeah. Me too. Too many times.

So I built **a tool, (I don't want to post the link here, because I don't want to spam, I'm looking for feedback)** a certificate monitoring and management tool built for *real-world* DevOps setups.

It handles:

* Public domains, keystores, cert folders
* Internal mTLS certs, air-gapped systems, embedded devices
* Azure Key Vault, HashiCorp Vault, and more coming soon
* Offline-friendly agent (keymon — [npm link](https://www.npmjs.com/package/keymon))
* Expiry alerts, tagging, environment grouping, ownership context

Basically: stop the tribal knowledge, spreadsheets, and “who owns this cert?” fire drills.

Curious how the DevOps crowd is managing internal certs these days, scripts? Prometheus exporters? Or just hoping Let’s Encrypt doesn’t let you down?

Would love feedback if you want to give it a spin, let me know and we can chat "offline", or just roast it if you hate certs as much as I do 😂

https://redd.it/1mfrayy
@r_devops
Micro services over monolithic

I know that micro services is not for everyone and specially if you just starting but can someone tell me in brief why a company can change to micro services architecture , like what happen so monolithic is not the right option anymore

https://redd.it/1mfryn8
@r_devops
Why Observability Isn’t Just for SREs (and How Devs Can Get Started)

Almost every other day, when I scroll past r/devops or r/sre, I see a post like this asking how a dev can get started with devops, observability, etc.

I've made a blog as an attempt for anyone lost to find their way into observability and a wake-up call for devs to they should think about observability more actively today than ever before!

A dev’s observability playbook.

Here's the link.

https://redd.it/1mfsvq8
@r_devops
Best path to learn DevOps fast with structure

Hi everyone 👋

I am working a full time 9 to 5 and I want to become a DevOps specialist as fast as possible. My goal is to build strong foundations quickly and then start working on my own projects, finding a DevOps job or starting taking small freelancing/consulting DevOps gigs.

I am trying to choose between three options:

1. TechWorld with Nana bootcamp: very visual and structured but a bit expensive and not always in depth according to feedback?
2. Cloud Engineer Academy with Suleymane: focused and looks serious but I do not know much about the results?
3. KodeKloud: very hands on but harder to stay focused or follow a single clear path as its a pick and choose and no real build up link between each section?

I personally feel that when you are busy with a full-time job, it is better to follow one structured course instead of jumping between free resources or YouTube. Otherwise it gets too messy and I lose time or motivation.

What would you recommend if you were in my shoes?
Ideally I want to build real world DevOps skills and be able to work as a consultant or freelancer in 8 months (if that even possible :D)

If you have experience with any of these or took a different fast track that worked, I would love to hear about it. Thanks a lot!

https://redd.it/1mfsr79
@r_devops
Devops role at an AI startup or full stack agent role at an Agentic Company ?

Hi Guys,

I am a new grad with experience in full stack development at a medium sized company, now i am looking for full time roles, i am conflicted between the two options, please help me out, I am super interested and passionate about getting into distributed systems, and the AI revolution is making me feel FOMO about learning and building AI Agents, what do you all think, what should i choose ?

https://redd.it/1mfs9qt
@r_devops
Deployment versioning problems?

I'm wondering if anyone else has issues keeping up with a variety of versions of different things deploying to different customers?


Does anyone else's company have 5+ helm charts (each versioned and released separately), distinct "appVersions" that are also versioned and released separately, along with other components (e.g. infrastructure) that have separate versions/release schedules? On top of all of that, each customer may be on a different set of versions of each of these things.


If so, how do you handle keeping track of all of them? Full disclosure, I'm considering building out a web app that helps keep track/visualize all of these versions/release schedules. Because the standard project management tools don't quite lay out the visualization exactly how I want it. I kind of want to see each component on a timeline of sorts that shows what version each component is at and which version a particular customer is on. Do you all know of any existing tools that excel at displaying/tracking this info?

https://redd.it/1mfy3o0
@r_devops
SchemaNest - Where schemas grow, thrive, and scale with your team.

Lightweight. Team-friendly. CI/CD-ready.

🚀 A blazing-fast registry for your JSON Schemas
Versioning & search via web UI or CLI
Fine-grained auth & API keys
Built-in PostgreSQL & SQLite support
Written in Go & Next.js for performance & simplicity
Built-in set up instructions for Editor, IDEs and more

🛠️ Drop it into your pipeline. Focus on shipping, not schema sprawl.
🔗 github.com/timo-reymann/SchemaNest

Questions / feedback?
You are welcome to post a comment here for suggestions/feedback and for bug reports and feature requests feel free to create issues/PRs!

https://redd.it/1mg1fl8
@r_devops
¿Qué herramienta de Infra como Código les ha roto más el alma… y cuál les ha salvado?

Estoy armando una plataforma visual (tipo Figma pero para infra) y estoy estudiando qué dolores reales tenemos los que trabajamos con Terraform, Pulumi, Ansible o CloudFormation.

Mi experiencia personal:

Terraform: poderoso pero el manejo de estado remoto es una bomba si lo tocas mal

Pulumi: lindo en teoría, pero he visto el SDK dejar de funcionar de un día a otro

Ansible: me gusta, pero cuando los playbooks se anidan demasiado, se vuelve infernal

CloudFormation: sinceramente no entiendo por qué AWS lo sigue empujando tanto


No vengo a vender nada, ni a sacar encuestas de marketing. Solo quiero saber:

🔹 ¿Qué les ha funcionado a largo plazo en equipos reales? 🔹 ¿Qué herramienta reemplazarían mañana mismo si pudieran?

Se vale rantear, llorar, filosofar. Estoy leyendo todo.

https://redd.it/1mg8p3n
@r_devops
Any way to make AWS + Cloudflare setup less painful? I'm burning out

Trying to spin up infra for a project and forgot how much overhead there is.

Setting up IAM, VPCs, EC2 roles, DNS, SSL certs, Cloudflare config… it’s just a mess. Even getting basic stuff working securely feels like a part-time job.

I’m not trying to over-engineer this, I just want to deploy to AWS and not worry about blowing up my weekend fixing config errors.

Anyone here using something that actually makes this easier?

https://redd.it/1mgavk3
@r_devops
Tired of K8s

I think I am not the only one who is tired of this monstrosity. Long story short, at some point maintaining K8s and all the language it carries becomes as expensive as reworking the whole structure and switching to custom orchestrator tailored for the task. I wish I would do it right from the start!

It took 4 devs and 3 month of work to cut the costs to 40%, workload to 80% and is a lot easier to maintain! god, why people jump in to this pile of plugins and services without thinking twice about consequences

https://redd.it/1mgc01h
@r_devops
Want to transition from full stack dev to devops

Basically the title. I have 3 yoe in full stack development. Now I want to transition to devops. I was preparing for AZ 200 but now I don't want to sit for that exam anymore. I'd rather prepare for AZ 400. I don't have hands on experience in things like terraform, ansible, Kubernetes, etc. I can't see any well defined path ahead of me. What should I do and how can I get noticed by recruiters?

https://redd.it/1mggsn8
@r_devops
Scaling down to 0 during non-business hours

Hey everyone,


I just wanted to ask if your team scales down to 0 during off hours?

How do you do it? Cron, KEDA, …

What scope are you responsible for? E.g. the whole test cluster, just some namespaces

What flavor of Kubernetes are you using? I would be particularly interested in ARO (Azure Red Hat OpenShift)

Is it common practice to remove nodes as well during off hours?

What were your pain points?

Did you notice any significant cost savings?


Thx!

https://redd.it/1mggmzk
@r_devops
Anyone here who is in big tech companies like apple, google, nvidia. Netflix?

If there is anyone who is in these companies can they reach out I need guidance to crack one of them..

https://redd.it/1mgiiaa
@r_devops
How we solved environment variable chaos for 40+ microservices on ECS/Lambda/Batch with AWS Parameter Store

Hey everyone,

I wanted to share a solution to a problem that was causing us major headaches: managing environment variables across a system of over 40 microservices.

The Problem: Our services run on a mix of AWS ECS, Lambda, and Batch. Many environment variables, including secrets like DB connection strings and API keys, were hardcoded in config files and versioned in git. This was a huge security risk. Operationally, if a key used by 15 services changed, we had to manually redeploy all 15 services. It was slow and error-prone.

The Solution: Centralize with AWS Parameter Store We decided to centralize all our configurations. We compared AWS Parameter Store and Secrets Manager. For our use case, Parameter Store was the clear winner. The standard tier is essentially free for our needs (10,000 parameters and free API calls), whereas Secrets Manager has a per-secret, per-month cost.

How it Works:

1. Store Everything in Parameter Store: We created parameters like /SENTRY/DSN/API_COMPA_COMPILA and stored the actual DSN value there as a SecureString.
2. Update Service Config: Instead of the actual value, our services' environment variables now just hold the path to the parameter in Parameter Store.
3. Fetch at Startup: At application startup, a small service written in Go uses the AWS SDK to fetch all the required parameters from Parameter Store. A crucial detail: the service's IAM role needs kms:Decrypt permissions to read the SecureString values.
4. Inject into the App: The fetched values are then used to configure the application instance.

The Wins:

Security: No more secrets in our codebase. Access is now controlled entirely by IAM.
Operability: To update a shared API key, we now change it in one place. No redeployments are needed (we have a mechanism to refresh the values, which I'll cover in a future post).

I wrote a full, detailed article with Go code examples and screenshots of the setup. If you're interested in the deep dive, you can read it here: https://compacompila.com/posts/centralyzing-env-variables/

Happy to answer any questions or hear how you've solved similar challenges!

https://redd.it/1mgl9tl
@r_devops
Terraform Associate (003) Exam – Sharing Study Resources That Helped Me Pass

Hi all,

Just wanted to share some resources that helped me pass the HashiCorp Certified: Terraform Associate (003) exam for those who are going to be taking the exam soon. If you're working in DevOps and considering the certification, I hope this helps streamline your study journey.

# 🎥 Free Video Tutorials

* **SuperInnovaTech** – Terraform Associate 003 Exam Preparation - Provisioning a simple website on AWS with Terraform
* **FreeCodeCamp** – Full-length Terraform Associate Course (003)
* **Cloud Champ** – Practice exam question explanations
* **DevOps Directive** – Comprehensive Terraform fundamentals course

# 📘 Practice Exams (on Udemy)

I found practice exams on Udemy to be especially useful for reinforcing concepts and understanding how questions are framed in the real exam. I mainly used the following resource,

Udemy Terraform Practice Exams course by Muhammad Saad Sarwar (Three full practice exams - usually under 15 dollars with discount code)

# 🔗 Official Guide

* [HashiCorp Certification Overview & Study Guide](https://learn.hashicorp.com/terraform/certification/terraform-associate)

# 💻 Hands-on Practice

Beyond video content, spending time actually writing Terraform code was the most valuable prep. Try deploying resources in the AWS free tier, experimenting with modules, remote backends, and state management. Combine this with mock exams to solidify your understanding.

# 💡 Extra Tip

If you’re buying any courses on Udemy, try using monthly discount codes like `AUG25` or `AUG2025` — they often reduce the price to under $15.

If anyone else has tips or resources that worked well for them, feel free to share below. Good luck to everyone preparing — and keep automating! 🚀

https://redd.it/1mgm77r
@r_devops
Dev with 3.5 years experience - how should I start learning DevOps?

I’ve been a full stack developer for 3.5 years and want to start learning DevOps. I’ve never worked in a DevOps role, but I don’t want to fully switch to DevOps either. From what I’ve seen in the job market, a lot of roles expect these skills and I think they’ll help me when I take the next step in my career.

What’s the best way to start?

* Bootcamp, online courses or self study?
* Which tools should I learn first?
* Any good projects or certifications to aim for?

Looking for advice from people who have done both dev and DevOps.

https://redd.it/1mgn1lq
@r_devops
Why do apps behave differently across dev/QA/staging/prod environments? What causes these infrastructure issues?

We're deploying the exact same code across all our environments (dev/QA/staging/prod) but still seeing different behaviors and issues. Even with identical branches, we're getting inconsistencies that are driving us crazy.

Are we the only team dealing with this nightmare, or is this a common problem? If you've faced similar issues with identical codebases behaving differently across environments, what turned out to be the culprit? Looking to see if this is just us or if other teams are also pulling their hair out over this.

https://redd.it/1mgnni6
@r_devops
our infra was fine. the ai pipeline wasn’t — 3 silent crashes we kept missing

I’m not here to sell a platform. this is about the dumb ways our llm pipeline kept breaking prod while dashboards stayed green.

**scenario you probably know:**
ci passes. health checks ok. then the “ai service” ships and returns perfect nonsense. sometimes it just 500s on first real call. infra looks clean. oncall eats the blame.

after too many postmortems we named the failures. turns out they’re boring devops problems wearing ai costumes:

* **bootstrap ordering** — services fire before deps ready. empty vector index, schema race, migrator lag. nothing explodes, but the first llm call has no data.
* **deployment deadlock** — circular waits: retriever ⇄ db ⇄ migrator. it “starts” but never becomes useful. traffic hits a zombie.
* **pre-deploy collapse** — version skew / missing secret. first prompt hits a cold model path and face-plants.

we wrote a **problem map** to keep ourselves honest. it has 16 failure modes

[`github.com/onestardao/WFGY/tree/main/ProblemMap/README.md`](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md)

what helped in practice:

* treat **knowledge boundary** like a health check. can the model say “don’t know” on a canary prompt? if not, it will bluff in prod.
* log **ΔS** (semantic jump) on your eval set. when ΔS > 0.85, deploy should go yellow; it means answers are fluent but logic detached.
* add a **semantic tree** artifact to ci. not transcripts, just node-level intent + module used. makes incident review tractable.
* first request in prod must be a **canary trio**: empty-query, adversarial, and known-fact. fail fast if one lies.

if you don’t want another service, we kept the control layer as a **.txt file** that wraps prompts and adds these checks. no binaries. no network calls. mit. dumb on purpose. it also happened to steady the model:

>

i’m not asking you to switch stacks. if you’re running rag/agents/chat and seeing **green deploys + red outcomes**, skim the map and tell me which number smells like your incident. i’ll point to the exact fix without vendor links.

again, map link (only):
[`https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md`](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md)

curious what other silent failures folks have seen. especially first-call crashes that didn’t show up in staging. we’ll add them to the map if we’re missing a pattern.

https://redd.it/1mh3s57
@r_devops
Self-hosted API docs or third-party platforms? why choose one over the other?

Hey everyone,

I’m exploring options for publishing API documentation, help me to decide between self-hosting tools like Docusaurus or Redoc, or using third-party platforms like GitBook, ReadMe, or somthing else.

For those with experience:

\- Why did you choose one over the other?

\- What are the key trade-offs in terms of customization, cost, collaboration, and maintenance?

\- Any regrets or strong recommendations?

https://redd.it/1mh623z
@r_devops
Will this help me in landing a DevOps role?

Hi. Appreciate it if anyone would take the time to give me some feedback. So I have a year of experience as a software developer and network assistant (I was expected to do both roles at my job ). Another 2 years as a web developer.

I'm just interested in knowing if including a nextjs social media app/webapp (community/dating webapp) with thousands of active users I created and maintain would be helpful if I were to ever apply for a devops role? Or would that not matter much in terms of getting the job and I should focus on doing helpdesk or sysadmin jobs first to show experience?

https://redd.it/1mh7267
@r_devops