Reddit DevOps
269 subscribers
4 photos
31K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
What are best practices when using templating tools (helm, kustomize, etc) and also a gitops model (like with ArgoCD)

Hey All,

I'm working on revamping our release process and I'm curious what everyone here thinks are the best practices when it comes to using templating tools like Kustomize and Helm while also following a GitOps workflow.

We use ArgoCD to manage our K8s deployments and currently pre-inflate our charts/process our kustomizations in CI which then pushes them to git. The logic is this ensures that the source of truth is truly immutable as we would be pointing at a specific git hash rather than trusting that Argo is correctly pointing at the correct versions of things and reconciling on the fly.

This ultimately slows down our release process quite a bit.

I'm considering pitching that we utilize Argo's ability to inflate charts/process kustomizations so we don't need to pre-inflate/process them which would speed things up a lot. I'm just trying to see what the unintended side effects of that could be.

Thanks!

https://redd.it/1ka2br5
@r_devops
How We Handle TBs of Trace Data: Apache Parquet + Smart Caching

In DevOps, dealing with large-scale distributed traces can be tricky. We’ve been using Apache Parquet to store trace data efficiently and improve the speed of our queries. By using columnar storage, we’ve drastically reduced I/O and made trace analysis much faster. Here’s how we combined this with caching and metadata management for optimal performance.

https://www.parseable.com/blog/opentelemetry-traces-to-parquet-the-good-and-the-good

https://redd.it/1ka0v8f
@r_devops
Video Terraform 101 for DevOps Engineers

Hey folks, 👋

I started my YouTube channel and want to focus on DevOps topics, to present different concepts in a pragmatic way. My last video, called "Terraform 101 for DevOps Engineers | Beginner’s Guide to Infrastructure as Code"
It's designed to give beginners (or anyone needing a refresher) a solid foundation on how Terraform fits into DevOps workflows.

I cover:
What Terraform actually is and why it's important
Core concepts like Providers, Resources, and State Management
How Terraform integrates into CI/CD pipelines - but plan to expand on this later
Common mistakes to avoid when using it in production

The goal was to keep it fast, practical, and beginner-friendly — no 2-hour theory lectures. 😅
If you're starting to automate your infrastructure or prepping for DevOps interviews, I think it'll help.

Here’s the link if you want to check it out:
👉 https://youtu.be/z3CLMsYtxYw

Feedback is more than welcome! Also, I am open to video ideas too! I have a solid backlog of planned videos, but I'm happy to cover something important to the community.
And of course, I would really appreciate subscriptions :)





https://redd.it/1ka43hp
@r_devops
Opinions on my personal project.

Hello r/devops!

I just worked on a personal project that I would appreciate your opinion on. It's an AWS Infrastructure automation pipeline using Jenkins, Terraform and Ansible.

* Terraform - Starts the EC2 instance using a launch template and auto-scaling group with all necessary attributes attached (Security groups, key-value pair, etc).
* Ansible - Logs into the EC2 instance, downloads services and copies necessary HTML and CSS files from my portfolio website into /var/www/html, making it visible from the browser.
* Jenkins - Has two pipelines.
* 'Create' pipeline
* Runs the terraform part to start the EC2 instance, retrieves IP of the new instance using the aws-describe command, and adds it to hosts file for ansible to use it. Then, runs the ansible part to get the website live.
* Triggered by a git push
* 'Destroy' pipeline
* Runs terraform destroy to take down the infrastructure safely.
* This is invoked by the 'create' pipeline and runs 15 minutes after it.

I did learn a lot about all these tools, credential security and management, automation, etc. Before y'all come at me, I know that some of my choices might seem weird, like - using Jenkins instead of Github Actions, or using Ansible when the entire thing can be taken care of by a user\_data script, or hosting it on AWS when I can just have it on my .github.io page.
I used the tools and technologies because I wanted to learn these tools specifically, as they seem to be more prevalent in job descriptions. Outside of these things, do you have any thoughts about whether it's *actually* a good project to have on my resume, whether it could impress potential hiring managers/recruiters, etc? Should I change something, use different tools, or anything else at all? I'm open to honest feedback and would love to improve. I love automation and I love building things, so I can do this all over again without an issue.

P.S - I'm a grad student with 2 years of experience as a System Engineer, just to give you an idea of my background.

https://redd.it/1ka9qxn
@r_devops
1
30 days into Network operations role -- Did I step into unsustainable chaos?

I started a new position 30 days ago at an MSP (Managed Service Provider) as a Network Operations Manager.

My original understanding was that I'd lead infrastructure migration projects at a structured, strategic pace — taking ownership of planning, execution, and building operational discipline.

I knew the environment might be somewhat messy — and I actually saw that as an opportunity to bring structure where it was needed.

But instead, an existing senior team member (let's call him Mark) immediately flooded the process with urgency:

– Meetings all day, often back-to-back

– Little to no time to plan deeply, reflect, or organize properly

– Constant interruptions and ad hoc requests — expectation to be hyper-responsive

– No official timeline from leadership, but Mark imposed a fast-track timeline anyway

Meanwhile, the CTO — who I technically report to — is largely absent:

– Doesn’t respond to emails

– Doesn’t return calls

– Occasionally appears briefly (e.g., grabbing a sandwich at the airport) but otherwise offers no active guidance

I also hired two team members early on, originally planning to assign them to focused infrastructure projects.

But with the current chaos, they are now being treated as generalists, expected to somehow cover a wide range of topics, including undocumented environments.

Additionally, while I was never explicitly told it was a "cloud-first MSP," the way the role was presented (focused on infrastructure modernization and migration leadership) led me to assume it was heavily cloud-oriented.

In reality:

– Only about 20% of the infrastructure is actually cloud-based.

– Roughly 40% is legacy systems, many undocumented, requiring reverse engineering just to understand what's running.

(For context, during the interview I asked for a website to learn more about the company, and was told they didn’t have one — in hindsight, that probably should have been a red flag.)

The biggest problem:

I was hired to bring structure, but the current rhythm is so accelerated that trying to implement thoughtful leadership would simply slow things down.

In short:

– I feel I’ve lost the leadership narrative I was hired for.

– I’m being forced to play at their chaotic rhythm instead of leading with my own structure and pace.

Mark himself is extremely intense:

– Wakes up at 3–5 AM

– Eats lunch by 9 AM

– Spends afternoons studying for certifications — while pushing the team at full speed

I was aiming for a leadership role where I could build, structure, and scale — not a permanent crisis-response role in a fragmented environment.

Am I overreacting?

Is this just what IT leadership looks like today?

You're welcome to criticize me.

I’d appreciate any references:

– Is this 50%, 70%, 90% of IT leadership roles now?

– Is this common across MSPs?

– Or are there still companies where structured leadership and thoughtful execution are respected?

\-- Does it make sense to stay 2 weeks more, or do you see a long term position worth enduring?

Thanks for reading — I’m trying to calibrate my expectations.

https://redd.it/1kaala1
@r_devops
Getting into Devops

I am thinking about taking the SANS GCSA (sponsored by my job) course I have about 2 years experience in IT I am trying to get into devops I was wondering whether we are allowed to put the projects on our resume and can we do them on how personal GitHub. And also would it be comprehensive enough to help me break into devsecops. And what should I be understanding before getting into the class to increase my chances of grasping and internalizing the concepts.

https://redd.it/1kafyqk
@r_devops
Devops or AI? For Freshers

Hi everyone, I am second year of college (B.Tech CSE). Just confused between 2 paths: DevOps or AI?
Please could anyone guide me which field to choose, considering internship & job availability for freshers and college students. So my career is secured (not forever, but atleast i step in the industry)
How much time will it take to learn? Project ideas (because I think unique projects are almost not possible now) for resumes?

PS: I understand that advices that follow your passion, see if you like solving maths or problems. I just want to secure my career in IT. I don't have problem doing maths as well as learning tools.

https://redd.it/1kah8zd
@r_devops
Need Advice

Hello Folks,

Need your advice here.

I am 24M and working as a service desk agent, in an MNC, have 2.6yrs of irrelevant experience of DevOps and I want to enter this field.

Will complete 3 yrs in my organisation very soon.

I have knowledge of AWS, Git, Docker, Jenkins, ECS, EKS, ECR and Terraform some monitoring tools such as New Relic and splunk.

Am I too late to get a change in DevOps?

Are these skillset enough?



https://redd.it/1kahmxi
@r_devops
How to keep up with industry news?

Help needed in keeping up with industry trends and standards?
Suggestions are welcome if there are any news letters or twitter folk that you follow to get this info.
I'm asking this because lately it feels like I'm doing nothing to understand what is happening in the other companies or how they ar using technology differently.

https://redd.it/1kaiu5c
@r_devops
DevOps friends: Would you use GitHub Pull Requests to self-serve cloud access (Terraform-based)?

Hey everyone,
I’m trying to validate an idea and would love your feedback:



Problem:
In most companies, developers need to constantly ask cloud admins for access to different environments (dev, staging, prod) or specific cloud services.
This slows things down, creates bottlenecks, and makes teams less autonomous.



Idea:
Instead of waiting for admins, developers could:
• Open a GitHub Pull Request
• Fill out a simple YAML (what access they need, what environment, what role)
• PR gets reviewed and approved by a team lead
• GitHub Action runs Terraform automatically to grant access
• (Optional) Access could auto-expire after a few hours/days.

Basically:
Access as Code, Self-service, GitOps-native.



Why I think it’s better:
• Developers already live in GitHub
• Access requests go through normal code review processes
• Everything is auditable
• No more “please grant me access” tickets
• Works across AWS / Azure / GCP



Question to you all:
• Would you or your team actually use something like this?
• What would stop you from adopting it?
• Anything missing you’d expect?



I’m considering building both:
• A self-hosted open source version (basic features)
• A SaaS version (more enterprise features: expiration, Slack integration, etc.)

Appreciate any brutally honest thoughts — even if you think it’s a bad idea! Thanks!

https://redd.it/1kajdbt
@r_devops
New to Kubernetes? Here’s When You Actually Need It (And When You Don’t)

Hi Folks, Managing 100+ containers across servers? Don’t do it manually, let Kubernetes automate the chaos for you! If you’re just starting out with Docker and Kubernetes, this post will help you understand when Kubernetes is truly needed and when simpler tools like Docker Compose are enough. This is part of the 60-day ReadList series #5, Simplifying Docker & Kubernetes, one post at a time!

TL;DR
1. When to use Docker Compose? Small projects (1–10 containers), single server.
2. When to use Kubernetes? Large apps with many containers, need auto-scaling, fault tolerance, and high availability.

Even for Computer Vision models like car damage detection, we used Docker Compose and it worked great! You don’t always need Kubernetes from day one.

Kubernetes addresses the challenges of managing containerized applications at scale. If you're a beginner, don't feel pressured to jump into Kubernetes too early. For small apps, Docker Compose can handle things perfectly. But as your app grows more traffic, more servers, more complexity so Kubernetes becomes a must-have for reliability, scaling, and automation.

Check out here folks, From Simple to Scalable: When to Choose Kubernetes Over Docker Compose

Stay tuned for more beginner-friendly posts as I dive deeper into Kubernetes concepts and hands-on commands!

https://redd.it/1kag90j
@r_devops
Disappointed by myself

Hey guys, I just want to open up a bit, since in IT you don't often get the chance.

I have been working as a DevOps Engineer for the past four years. My organization has never given me a chance to work on actual DevOps tools (they handed me Azure DevOps classic pipelines and some change processes in ServiceNow), shifting me between internal teams and keeping me busy with this. I have never gotten a chance to explore and upskill myself with the latest tools.

Today, an internal call was set up for my technical interview, and I completely choked. It was really awkward not being able to answer any questions.

I feel disappointed in myself. I want to learn and excel at my job but am not getting proper support. I can't switch jobs due to market volatility and this 90-day notice period. There isn't a single, worthwhile roadmap that covers everything step-by-step and is easy to learn.

I can only cry now; I can't do much for myself.

https://redd.it/1kal4gy
@r_devops
Query OpenSearch logs and export them to CSV or JSON.

Hey there, I had someone ask me to do this task at work and I decided to share the script if anyone finds it helpful, because I haven't found any similar, simple scripts.

https://github.com/polymons/opensearch-export




https://redd.it/1kajlem
@r_devops
yaml vs alterantives as a configuration language

There's a number of relatively recent configuration language as a replacement for yaml:

- jsonnet (https://github.com/google/jsonnet)
- pkl (https://github.com/apple/pkl)
- cue (https://github.com/cue-lang/cue)
- hcl (https://github.com/hashicorp/hcl)

Do you use any of them? What was your experience? Did I miss any other languages? Do you think anyone of them is replacing yaml/helm for kubernetes configuration?

https://redd.it/1kaomen
@r_devops
How to debug Kafka consumer applications running in a Kubernetes environment

Hey all, sharing a guide we wrote on debugging Kafka consumers without the overhead of rebuilding and redeploying your application.

I hope you find it useful, and would love to hear any feedback you might have.

🔗 Link

https://redd.it/1kapojt
@r_devops
Filtering health checks from observability data feels wrong… is it actually right?

Recently, I was trying out different optimisations to reduce observability noise from my app in my OpenTelemetry collector.

Ofc, one of the first methods that came up was filtering, and almost everywhere the examples given were on filtering health checks and synthetic monitoring calls.

When I read this I was confused. The point of health check calls (afaik) is to check the liveness of a service and if it's up, right? Isn't that a crucial metric to observe? Why would I filter that and discard it as noise?

Went down the rabbit hole a bit and realised the answer is more about **noise vs signal**:

* Health checks (like `/health`) usually get called every few seconds per pod, across dozens/hundreds of services.
* If you're capturing traces, logs, or metrics for every one of those probes, you're just generating **tons of repetitive, low-value telemetry** that becomes noisy and heavy on your pocket, without adding any meaning.
* **Most modern observability setups (especially Kubernetes environments) already track pod liveness probes separately, ie, you get infra metrics like "pod up/down", "readiness failures" without needing to generate extra spans or logs every time a health check hits.**

The last reason is why we usually filter out health check calls from the APM level and leave it to the infra level. Also, makes sense as to why filtering health checks is always just cutting down the noise.

I'm writing a blog on cutting observability costs (putting my observations into perspective) and would love to know if you also aggressively filter these calls or if you just are meh about it.





https://redd.it/1karboi
@r_devops
New to DevOps – Need Guidance from Senior Engineers (Have Free Access to Coursera)

Hey folks,

I'm just starting my DevOps journey and could really use some advice from those of you who are further down the path—especially senior DevOps engineers.

I recently got access to a Coursera license through my school, and I want to make the most of it while I can. There's a ton of content out there (certs, courses, tools, cloud providers, etc.), and honestly, it's a bit overwhelming.

What would you recommend I focus on first? I see things like Docker, Kubernetes, Jenkins, Terraform, AWS, GCP, CI/CD, etc., thrown around a lot. But I want to build a solid foundation without spreading myself too thin or wasting time on stuff that's not as relevant early on.

If you were starting over today, knowing what you know now, what would your roadmap look like?
Also, any Coursera-specific courses or certs you'd strongly recommend?

Really appreciate any input. Thanks in advance!

https://redd.it/1kaquvi
@r_devops
OneUptime: Open-Source Incident.io Alternative

OneUptime (https://github.com/oneuptime/oneuptime) is the open-source alternative to Incident.io + StausPage.io + UptimeRobot + Loggly + PagerDuty. It's 100% free and you can self-host it on your VM / server. OneUptime has Uptime Monitoring, Logs Management, Status Pages, Tracing, On Call Software, Incident Management and more all under one platform.

Updates:

Native integration with Slack: Now you can intergrate OneUptime with Slack natively (even if you're self-hosted!). OneUptime can create new channels when incidents happen, notify slack users who are on-call and even write up a draft postmortem for you based on slack channel conversation and more!

Dashboards (just like Datadog): Collect any metrics you like and build dashboard and share them with your team!

Roadmap:

Microsoft Teams integration, terraform / infra as code support, fix your ops issues automatically in code with LLM of your choice and more.

OPEN SOURCE COMMITMENT: Unlike other companies, we will always be FOSS under Apache License. We're 100% open-source and no part of OneUptime is behind the walled garden.

https://redd.it/1kaubww
@r_devops
Nix and NixOS

I was getting overwhelmed by using dotfiles to provision my own local dev machines, so tried out Nix (run on Ubuntu). I really like the way they do things, but it's a bit of a learning curve. Maybe I'm gonna try switch to NixOS for a while.

But thinking in terms of the future, it doesn't seem so universally adopted like Docker and Wasm. Is it really useful to learn NixOS? Or better to just use Docker?

https://redd.it/1kawieb
@r_devops
Kubernetes Cluster usage correct or not?

I'm a devsecops intern and in our company we are given access to the k8s cluster like this :

After connecting to the company's vpn, me and other devsecops intern need to ssh to one of the 3 master nodes in cluster via a user 'intern' and then I can run kubectl commands from there..


I want to ask if that's the best way to work on the cluster? Isn't supposed that I can talk to cluster from my machine withou having to ssh to the master node?

https://redd.it/1kav8tq
@r_devops