Reddit DevOps
269 subscribers
4 photos
31K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Dev with 3.5 years experience - how should I start learning DevOps?

I’ve been a full stack developer for 3.5 years and want to start learning DevOps. I’ve never worked in a DevOps role, but I don’t want to fully switch to DevOps either. From what I’ve seen in the job market, a lot of roles expect these skills and I think they’ll help me when I take the next step in my career.

What’s the best way to start?

* Bootcamp, online courses or self study?
* Which tools should I learn first?
* Any good projects or certifications to aim for?

Looking for advice from people who have done both dev and DevOps.

https://redd.it/1mgn1lq
@r_devops
Why do apps behave differently across dev/QA/staging/prod environments? What causes these infrastructure issues?

We're deploying the exact same code across all our environments (dev/QA/staging/prod) but still seeing different behaviors and issues. Even with identical branches, we're getting inconsistencies that are driving us crazy.

Are we the only team dealing with this nightmare, or is this a common problem? If you've faced similar issues with identical codebases behaving differently across environments, what turned out to be the culprit? Looking to see if this is just us or if other teams are also pulling their hair out over this.

https://redd.it/1mgnni6
@r_devops
our infra was fine. the ai pipeline wasn’t — 3 silent crashes we kept missing

I’m not here to sell a platform. this is about the dumb ways our llm pipeline kept breaking prod while dashboards stayed green.

**scenario you probably know:**
ci passes. health checks ok. then the “ai service” ships and returns perfect nonsense. sometimes it just 500s on first real call. infra looks clean. oncall eats the blame.

after too many postmortems we named the failures. turns out they’re boring devops problems wearing ai costumes:

* **bootstrap ordering** — services fire before deps ready. empty vector index, schema race, migrator lag. nothing explodes, but the first llm call has no data.
* **deployment deadlock** — circular waits: retriever ⇄ db ⇄ migrator. it “starts” but never becomes useful. traffic hits a zombie.
* **pre-deploy collapse** — version skew / missing secret. first prompt hits a cold model path and face-plants.

we wrote a **problem map** to keep ourselves honest. it has 16 failure modes

[`github.com/onestardao/WFGY/tree/main/ProblemMap/README.md`](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md)

what helped in practice:

* treat **knowledge boundary** like a health check. can the model say “don’t know” on a canary prompt? if not, it will bluff in prod.
* log **ΔS** (semantic jump) on your eval set. when ΔS > 0.85, deploy should go yellow; it means answers are fluent but logic detached.
* add a **semantic tree** artifact to ci. not transcripts, just node-level intent + module used. makes incident review tractable.
* first request in prod must be a **canary trio**: empty-query, adversarial, and known-fact. fail fast if one lies.

if you don’t want another service, we kept the control layer as a **.txt file** that wraps prompts and adds these checks. no binaries. no network calls. mit. dumb on purpose. it also happened to steady the model:

>

i’m not asking you to switch stacks. if you’re running rag/agents/chat and seeing **green deploys + red outcomes**, skim the map and tell me which number smells like your incident. i’ll point to the exact fix without vendor links.

again, map link (only):
[`https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md`](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md)

curious what other silent failures folks have seen. especially first-call crashes that didn’t show up in staging. we’ll add them to the map if we’re missing a pattern.

https://redd.it/1mh3s57
@r_devops
Self-hosted API docs or third-party platforms? why choose one over the other?

Hey everyone,

I’m exploring options for publishing API documentation, help me to decide between self-hosting tools like Docusaurus or Redoc, or using third-party platforms like GitBook, ReadMe, or somthing else.

For those with experience:

\- Why did you choose one over the other?

\- What are the key trade-offs in terms of customization, cost, collaboration, and maintenance?

\- Any regrets or strong recommendations?

https://redd.it/1mh623z
@r_devops
Will this help me in landing a DevOps role?

Hi. Appreciate it if anyone would take the time to give me some feedback. So I have a year of experience as a software developer and network assistant (I was expected to do both roles at my job ). Another 2 years as a web developer.

I'm just interested in knowing if including a nextjs social media app/webapp (community/dating webapp) with thousands of active users I created and maintain would be helpful if I were to ever apply for a devops role? Or would that not matter much in terms of getting the job and I should focus on doing helpdesk or sysadmin jobs first to show experience?

https://redd.it/1mh7267
@r_devops
Looking for Freelance Opportunities - Kubernetes | DevOps | Platform Engineering

I hope you're doing well.

I’m a certified Kubernetes professional (CKA & CKS) with over 6 years of experience in Platform Engineering, DevOps, SRE, and System Engineering. I've worked across multiple domains and tech stacks, helping teams build reliable, scalable, and secure infrastructure & Platforms.

Currently, I have some availability and am open to taking on a few freelance projects. Whether it’s Kubernetes setups, CI/CD pipelines, infrastructure automation, or cloud-native solutions.

If you know of any opportunities or are looking for someone to support your team on a short-term or project basis, I’d really appreciate it if you could reach out or refer me.

Thank you so much for your time and support! :-)

https://redd.it/1mh88x1
@r_devops
Is my Bitbucket pipeline YAML file good? Would love feedback!

Hey folks 👋

I'm working on a Bitbucket pipeline for a Node.js project and wanted to get some feedback on my current bitbucket-pipelines.yml file. It runs on pull requests and includes steps for installing dependencies, running ESLint and formatting checks, validating commit messages, and building the app.

Does this look solid to you? Are there any improvements or best practices I might be missing? Appreciate any tips or suggestions 🙏

image: node:22

options:
size: 2x

pipelines:
pull-requests:
"":
- step:
name: Install Dependencies
caches:
- node
script:
- echo "Installing dependencies..."
- npm ci
- echo "Dependencies installed successfully!"
artifacts:
- nodemodules/**
- parallel:
- step:
name: Code Quality Checks
script:
- echo "Running ESLint..."
- npm run eslint
- echo "Checking code formatting..."
- npm run format:check
- step:
name: Validate Commit Messages
script:
- echo "Validating commit messages in PR..."
- npm run commitlint -- --from origin/$BITBUCKET
PRDESTINATIONBRANCH --to HEAD --verbose
- step:
name: Build Application
script:
- echo "Building production application..."
- npm run buildProd

https://redd.it/1mh8ub4
@r_devops
Tracing stack advise for large Java monolith

Hi all,

I have ~70 app servers running a big Java monolith. While it’s technically one app, each server has a different role (API, processing, integration, etc.).

I want to add a tracing stack and started exploring OpenTelemetry. The big blocker? It requires adding spans in the code. With millions of lines of legacy Java, that’s a nightmare.

I looked into zero-code instrumentation, but I’m not confident it’ll give me what I want—specifically being able to visualize different components (API vs. processing) cleanly in something like Grafana.

Has anyone faced something similar? How did you approach it? Any tools/strategies you’d recommend for tracing with minimal code changes?

https://redd.it/1mhbgh4
@r_devops
Beta testers wanted: CLI tool to detect DB schema drift across Dev, Staging, Prod – Git-workflow, safe, reviewable. Currently MSSQL and MySQL

I’ve been working on a CLI tool called dbdrift – built to help track and review schema changes in databases across environments like Dev, Staging, Prod, and even external customer instances.

The goal is to bring Git-style workflows to SQL Server and MySQL schema management:

\- Extracts all schema objects into plain text files – tables, views, routines, triggers
\- Compares file vs. live DB and shows what changed – and which side is newer
\- Works across multiple environments
\- DBLint engine to flag risky or inconsistent patterns

It’s standalone (no Docker, no cloud lock-in), runs as a single binary, and is easy to plug into existing CI/CD pipelines – or use locally (win/linux/macosx).

I’m currently looking for beta testers who deal with:

Untracked schema changes
db struct breaking changes
database reviews before deployment
database SQL code lint process

Drop a comment or DM if you’d like to test it – I’ll send over the current build and help get you started. Discord also available if preferred.

https://redd.it/1mhd4ba
@r_devops
I had no idea how to start learning AWS, here’s what actually helped me

When I first tried to learn AWS, I felt completely lost. There were all these services — EC2, S3, Lambda, IAM and I had no clue where to begin or what actually mattered. I spent weeks just jumping between random YouTube tutorials and blog posts, trying to piece everything together, but honestly none of it was sticking.

someone suggested I should look into the AWS Solutions Architect Associate cert, and at first I thought nah, I’m not ready for a cert, I just want to understand cloud basics. But I gave it a shot, and honestly it was the best decision I made. That cert path gave me structure. It basically forced me to learn the most important AWS services in a practical way like actually using them, not just watching videos understanding the core concepts.

Even if you don’t take the exam, just following the study path teaches you EC2, S3, IAM, and VPC in a way that actually makes sense. And when I finally passed the exam, it just gave me confidence that I wasn’t totally lost anymore, like I could actually do something in the cloud now and i have learned something.

If you’re sitting there wondering where to start with AWS, I’d say just follow the Solutions Architect roadmap. It’s way better than going in blind and getting overwhelmed like I did. Once you’ve got that down, you can explore whatever path you want like DevOps, AI tools, whatever you want but at least you’ll know how AWS works at the core.

also if anyone needs any kind of help regarding solution architect prep you can get in touch...

https://redd.it/1mhewp9
@r_devops
Mid-30s+ Engineers: How are you preparing for the AI revolution? Feeling behind and anxious

I'm an automation engineer with about 10+ years of experience working at a media agency. My day-to-day involves building internal tools, process automation, and managing data pipelines, but nothing crazy. I'm not in any leadership position currently, and my domain is media/marketing tech rather than core tech product companies. my main skills are in Python and Cloud.

With all the AI developments happening lately, I'm genuinely concerned about where my career is heading. The main issue I'm facing is a skills gap - my current role doesn't involve AI or machine learning at all. While I'm decent at what I do, I can't shake the feeling that I'm not building future-ready skills. Being in a media agency makes it even more challenging because there aren't many opportunities to transition into AI-focused roles internally.

I'm hoping to get advice from folks who might have gone through something similar. Has anyone here made a transition from traditional automation work to AI-related roles? How did you manage it? Should I focus on becoming really excellent at my current automation and DevOps skills, or should I try to pivot completely into AI? Are there specific areas where my automation experience might actually be valuable in AI workflows?

For anyone who switched career tracks in their 30s, I'd really appreciate some practical advice on managing this kind of transition. I'm not panicking about it, but I definitely want to make smart decisions for the next 5-10 years of my career. Thanks in advance for any insights you can share!

P.S. I used AI to help improve this post.

https://redd.it/1mhfstb
@r_devops
Best practices for migrating manually created monitors to Terraform?

Hi everyone,

We're currently looking to bring our manually created Datadog monitors under Terraform management to improve consistency and version control. I’m wondering what the best approach is to do this.

Specifically:

Are there any tools or scripts you'd recommend for exporting existing monitors to Terraform HCL format?
What manual steps should we be aware of during the migration?
Have you encountered any gotchas or pitfalls when doing this (e.g., duplication, drift, downtime)?
Once migrated, how do you enforce that future changes are made only via Terraform?

Any advice, examples, or lessons learned from your own migrations would be greatly appreciated!

Thanks in advance!

https://redd.it/1mhho6n
@r_devops
AWS RDS granular backup

Currently, our company manages all RDS backups using snapshots for PostgreSQL, MySQL, Oracle, and SQL Server. However, we've been asked to provide more granular backup capabilities — for example, the ability to restore a single table.

I'm considering setting up an EC2 instance to run scripts that generate dumps and store them in S3. Does this approach make sense, or would you recommend a better solution?

https://redd.it/1mhhbr5
@r_devops
Helm gets messy fast — how do you keep your charts maintainable at scale?

One day you're like “cool, I just need to override this value.”
Next thing, you're 12 layers deep into a chart you didn’t write… and staging is suddenly on fire.

I’ve seen teams try to standardize Helm across services — but it always turns into some kind of chart spaghetti over time.

Anyone out there found a sane way to work with Helm at scale in real teams?

https://redd.it/1mhben0
@r_devops
Careers UK?

Had a couple of job offers but nothing major in the past few months. 2 years of experience, reckoning I could achieve £60k.

LinkedIn and Indeed just aren’t cutting it anymore for me. I’ve also found applying direct to company gives me more success than recruiters reaching out about FinTech jobs all the time. What do people use in the UK for looking for jobs?

https://redd.it/1mhp1em
@r_devops
Generalize or Specialize?

I came across an ever again popping up question I'm asking to myself:

"Should I generalize or specialize as a developer?"

I chose developer to bring in all kind of tech related domains (I guess DevOps also count's :D just kidding). But what is your point of view on that? If you sticking more or less inside of your domain? Or are you spreading out to every interesting GitHub repo you can find and jumping right into it?

https://redd.it/1mhsle9
@r_devops
Anyone found a stable way to run GPU inference on AWS without spot interruptions?

We’re running LLM inference on AWS with a small team and hitting issues with spot reclaim events. We’ve tried capacity-optimized ASGs, fallbacks, even checkpointing, but it still breaks when latency matters.

Reserved Instances aren’t flexible enough for us and pricing is tough on on-demand.

Just wondering — is there a way to stay on AWS but get some price relief and still keep workloads stable?

https://redd.it/1mhu165
@r_devops
Most common Startup Problem - Want to rotate a secret ? - But not knowing where that secret actually existed across our codebase.

Does any paid or free tool offer this solution in appsec space ?

We have recently integrated this feature with DefendStack-Suite asset inventory, we were just trying to solve a problem for one startup.

https://redd.it/1mi072h
@r_devops
Indexing issue on my laravel website

Hey everyone, I’ve recently launched a website built with Laravel, but I'm facing issues with getting it indexed by Google. When I search, none of the pages appear in the search results. I’ve submitted the site in Google Search Console and even tried the URL inspection tool, but it still won’t index. I’ve checked my robots.txt file and meta tags to make sure I’m not accidentally blocking crawlers, and I’ve also generated a proper sitemap using Spatie’s Laravel Sitemap package. The site returns a 200 status code and appears to be mobile-friendly. Still, nothing shows up in the index. Has anyone faced similar issues with Laravel SEO or indexing? Any advice or fixes would be appreciated!


https://redd.it/1mi1fzd
@r_devops
Manager gave bad reviews for getting too involved in code level details

So basically what the title says, my manager gave me a 3/5 rating on satisfaction and his remarks were that I get involved in code level details which is the work of the developers. What even is DevOps then ?? Why the fuck won't I check the code to get an overall understanding of the project, later if anything goes wrong in deployment they'll blame the DevOps people.idk man my company has a totally different understanding of what DevOps means, hardly includes me in regular project meetings . To make it clear i don't mess with the code, I just ask questions related to the app logic or something necessary for the pipeline or cloud infra .

https://redd.it/1mi275k
@r_devops