Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Best Docker registry with image housekeeping support

Hi all,

We’re looking to set up a private Docker registry for our company and one of our must-have features is automatic housekeeping — we need to delete old or unused images to manage disk usage effectively.

We use Jenkins for CI/CD, which pushes images frequently, so over time our registry gets cluttered with outdated builds and untagged layers. We'd like a solution that can:

Run scheduled or on-demand cleanup jobs

Support retention policies (e.g., keep last N images or delete images older than X days)

Ideally offer a web UI and/or API for managing images

Integrate well with Jenkins or at least not get in the way


We’re currently evaluating Harbor and Nexus, but open to other suggestions too. What are you using in production for this kind of setup? Any pros/cons we should know about?

Thanks!

https://redd.it/1kv5o1v
@r_devops
transition to a devops career and the importance of certifications in the career.

I have experience in support and some infrastructure (networks and basic Linux). What would be an ideal schedule to follow to make the most of my career transition?



Another question: do certifications like LPI have an important requirement to apply for these positions?

https://redd.it/1kv7rku
@r_devops
DevOps Buddy wanted! LeetCode, tech chats, open source & more!

Hey Reddit!

Looking for someone to team up with for DevOps stuff. I wanna get better at LeetCode, chat about cool tech, mess around with open-source projects, and just keep each other motivated.

I'm really into DevOps and trying to learn more about [mention something specific you're into, like Kubernetes or AWS]. LeetCode's on my list to boost my problem-solving.

If you're up for:
* LeetCode sessions: Let's tackle problems and share ideas.
* DevOps talks: Bouncing ideas around, discussing tools, or just complaining about YAML. 😉
* General tech chats: What's new? What's cool?
* Open source fun: Exploring or even contributing.
* Being accountability buddies: Keeping each other on track.

You don't have to be a guru, just enthusiastic about learning. We can link up online (Discord/Telegram, etc.) whenever works.

If this sounds like your jam, hit me up with a comment or a DM! Let's learn together.


https://redd.it/1kv8ryp
@r_devops
How I Automated My Infrastructure with Terraform

Hello everyone!
I wanted to share one of my more... questionable engineering decisions: I Terraformed my entire home network.

I've been managing my Mikrotik setup (router + switches + wireless) with Terraform for about a year now. Everything from VLANs to firewall rules is defined as code and version controlled.

All of the code is avaliable here: https://github.com/mirceanton/mikrotik-terraform/

Why Terraform for networking?
Honestly, because it's the tool I know. When I found out the RouterOS provider existed, I just had to try it. Probably not the most practical approach, but it's been a great learning experience!

The state management situation is... creative. Can't exactly use S3 when you might accidentally terraform your own internet connection away! I ended up going with local state + SOPS encryption + Git. Works, i guess, but it's definitely not textbook.

Oh, and the amount of terraform state mv commands I've run during refactoring... SO many. I can't just destroy and recreate resources because they are, quite literally, my internet connection. I don't think I've ever had to do this much state surgery... even at work.

The whole thing taught me a lot about both Terraform and networking. Sometimes picking an overly complicated approach is the best way to learn!

Made a video about it too, if you're interested, wwhereI go into my setup as well, not just the code https://youtu.be/86LRoxuU5kg

Anyone else using Terraform in non-conventional ways? Would love to hear about other creative use cases or approaches!

https://redd.it/1kv99c6
@r_devops
Learn by doing

I'm looking to team up with some like-minded individuals who have a basic grasp of various tools and are ready to jump into some exciting projects! I've got a few cool ideas we could start working on together.

If you're interested in collaborating and bringing some of these ideas to life, let's create a Discord server and get started

https://redd.it/1kvdbhj
@r_devops
Hiring Managers

1) What are some of the skills with the most demand right now and will stay in demand for the next 30 or so years?

2) How is the job market right now for Cloud/DevOps and SRE roles?

https://redd.it/1kvesqr
@r_devops
Bare metal K8s Cluster Inherited


We inherited an infrastructure consisting of 5 physical servers that make a k8s cluster. One master and four worker nodes. They also allowed load inside the master itself as well.

It is an ancient installation and the physical servers have either RAID-0 or single disk. They used OpenEBS Hostpath for persistent volumes for all the products.

Now, this is a development cluster but it contains important data. We have several small issues to fix, like:

- Migrate the PV to a distributed storage like NFS

- Make backups of relevant data

- Reinstall the servers and have proper RAID-1 ( at least )

We do not have much resources. We do not have ( for now ) a spare server.

We do have a NFS server. We can use that.

What are good options to implement to mitigate the problems we have? Our goal is to reinstall the servers using proper RAID-1 and migrate some PV to NFS so the data is not lost if we lose one node.

I listed some actions points:

- Use the NFS, perform backups using Valero

- Migrate the PVs to the NFS storage


At least we would have backups and some safety.

But how could we start with the servers that do not have RAID-1? The very master itself is single disk. How could we reinstall it and bring it back to the cluster?

The ideal would be able to reinstall server by server until all of them have RAID-1 ( or RAID-6 ). But how could we start. We have only one master and PV attached to the nodes themselves

Would be nice to convert this setup to proxmox or some virtualization system. But I think this is a second step.

Thanks!

https://redd.it/1kvdnb3
@r_devops
Scaling Postgres with Kubernetes, guide on partitioning sharding and replication

i have written a guide on setting up high availability Postgres cluster with sharding, replication and partitioning. Hope you find this helpful. 🐘



https://blog.sagyamthapa.com.np/scaling-postgresql-with-kubernetes

https://redd.it/1kvdc66
@r_devops
👍1
Developer to Devops resume review

I'm a backend developer with over 2.5 years of experience, and I’m looking to transition into a DevOps role. In my resume, the Developer and DevOps roles are listed under the same company. I’ve been involved in DevOps tasks for the past year, but there wasn’t much to learn beyond the tools I’ve already mentioned. That’s why I worked on personal projects to gain a deeper understanding.

Most of the DevOps skills I’ve acquired have been through these personal projects.

I’ve currently separated the Developer and DevOps roles into two parts on my resume, as I wasn’t sure how to present the experience correctly.

I would appreciate your guidance while keeping these points in mind. I’m open to omitting anything unnecessary and willing to add whatever is needed.

My resume below..
kindly review
https://i.postimg.cc/4x1BFCXw/IMG-20250523-225607.jpg

https://redd.it/1kviy4n
@r_devops
cheaper datadog alternative for APM?

Our datadog bill is starting to get eye watering for web APM purposes. We use datadog for web APM because we need insight into site code for a couple of python and nodejs services, and well.. they were the safe choice. But our data volume has gone up quite a bit over the past 4 months so i'm now tasked to evaluate other options.

We already use elastic for an internal service and we're happy with that, so that could be an option for logging. I'm open to ideas, Honeycomb, Sentry, Sumo Logic, Splunk, New Relic, Dynatrace, Grafana, Groundcover, whatever works. Cloud Metrics are cool but that's not what we use DD for. So if it can't do traces it's automatically a non-starter. Preferably no deep dev integration (or code change would be great).. we just don't have the resource got other fire fights to deal with. Open to database APM feature, good over postgresql work loads and then tying web apm traces to db traces.

Advice / input appreciated.

https://redd.it/1kvlssd
@r_devops
How I Blocked 95% of Web Attacks Using AWS WAF Blog


I recently wrote a blog post about securing web apps using AWS WAF, and how you can block up to 95% of common attacks (like SQL injection, XSS, bot traffic, and even basic DDoS) with just a few clicks in the AWS Console.

If you’re on AWS and haven’t tried WAF yet (or find it intimidating), this guide breaks it down step by step:

https://blog.prateekjain.dev/how-to-block-up-to-95-of-attacks-using-aws-waf-e2223efc1f55?sk=cc74156befaab48297655a00f352f4e6

https://redd.it/1kvm4gp
@r_devops
Best books/Courses to transition from Developper to Devops

Hello everyone,
i am a fullstack developper with 4 years of experience. I use Angular/Typescript for frontend and SpringBoot/Java for the backend.

I also have basic knowledge of Docker, basic knowledge of Jenkins (using the pipeline and writing basic templates), i also have Kubernetes Developer Certification and some knowledge in cloud (AWS basic services , and have azure fundamentals), and some linux basics.

I would like to transition from developer to Devops but i am a bit lost in what path to follow. So i would like some recommendation for couple of books or courses to help me transition to Devops.



PS: I know it depends, and maybe a bit subjective but any guide would help me understand.

Thank you!


https://redd.it/1kvoyoz
@r_devops
Build an incident response workflow with n8n + Prometheus

Hey guys,

I’m working on a monitoring setup that automates basic incident resolutions.

This is the visualization of the flow:

https://drive.google.com/file/d/1HiobPj50VZp1VylyqLTXLAeqDoJtrG\_x/view

I’m using Prometheus - Grafana for monitoring, Alertmanager to send alerts, and n8n to orchestrate a workflow, then an AWS Lambda function to restart the services. “Restart services” is a kind of demo action, you can customize it for your needs.

How does it work?

Prometheus: I configure some basic rules to alert when CPU/Memory exceeds a threshold. When the thresholds are exceeded, it will send a webhook to n8n system.
N8n flow: Get information, analyze the metrics, calculate the business hours or incident duration, and send alerts to Discord or escalate to PagerDuty.
AI agent (in n8n): I define a prompt to check for the input. I will consider the metrics and current contexts to decide whether to restart the services or not.
Lambda function: Receive the commands from AI agent and process if necessary. Currently, I grant it to restart an EC2 instance to make the service available again when the system overloaded.

I hope this helps you to apply an automated stack in your team. I’ve shared the example materials in those repositories:

One-click to set up Prometheus - Alert Manager - Grafana at

[
https://github.com/Bubobot-Team/monitoring-stack/tree/main/stacks/prometheus-stack](https://github.com/Bubobot-Team/monitoring-stack/tree/main/stacks/prometheus-stack)

N8n workflow in JSON format (just copy into your n8n dashboard): https://github.com/Bubobot-Team/automation-workflow-monitoring

Btw, just wondering, what recovery actions would you automate? (e.g., disk cleanup, rollback deployments). I would like to hear your feedback to improve the current flow.

https://redd.it/1kvqdph
@r_devops
Container is instance of image like in coding an object is instance of class?

class Dog {
String name;
int age;

Dog(String name, int age) {
this.name = name;
this.age = age;
}
}

// Creating multiple instances with different values
Dog dog1 = new Dog("James", 3);
Dog dog2 = new Dog("Bella", 5);

Docker

docker run -d --name app1 -e NAME=James -e AGE=3 mydogimage
docker run -d --name app2 -e NAME=Bella -e AGE=5 mydogimage



Is this true or I misunderstand

https://redd.it/1kvvp25
@r_devops
Atlassian Bamboo

Any devops who are still using this?

I’m 3 months into my promotion as devops engineer and have been given the keys to the bamboo kingdom.

It’s legacy and deprecated I believe. Also, with it being on premise it’s not the easiest to lab.

Interested in finding out who still uses this and how they find it?

I’m currently implanting a snyk integration for our code.

Thanks and have a wonderful day!

https://redd.it/1kvx0mg
@r_devops
Migration from GCP to OCI instances

I have 10+ servers on GCP which I want to migrate to oci. Some are production instances with live traffic and some are dev/testing servers. What is the best approach to migrate along with all the data. Is there a possibility of transferring snapshots?
GCP instances are running on centOS while the oci will run the Oracle linux images.
Any lead will be helpful

https://redd.it/1kvy85p
@r_devops
Questions about the LFS258 Kubernetes Course – Worth It for CKA Prep?

Hi everyone,

I'm looking into taking the **LFS258 - Kubernetes Fundamentals** course from the Linux Foundation, and I have a few questions for those who have taken it:

* Is the course mostly pre-recorded video lectures?
* Does it include hands-on labs and troubleshooting practice?
* Is it beginner-friendly for someone with **no prior Kubernetes experience**?
* Is it enough on its own to prepare for the **CKA (Certified Kubernetes Administrator)** exam?
* Would you recommend buying **just the course**, or going for the **bundle with the exam voucher**?
* Are there any known **discount codes or promotions** for this course?
* Lastly, would you say this course is a good choice for someone coming from a **Cloud Engineering background** and looking to transition into **DevOps**?

Appreciate any insights or advice you can share – thank you!

https://redd.it/1kw1ner
@r_devops
I ruined a POC

Been a DevOps from 4.5 years. Started from Linux administrator and now I'm managing cloud, db and container orchestration. So my manager asked me to do a POC on traefik which is a reverse proxy just like nginix. I did well, explored the features but was unable to implement fail2ban plugin in it. When I was presenting the same to my manager, i forgot basic docker compose syntax and now I think my role is in jeopardy. Anyone else faced this? Motivate me please, I'm scared.

https://redd.it/1kw0o9g
@r_devops
What’s the best SSO solution for a +50 mid-sized company in 2025?

Curious to hear what the DevOps community is seeing work best today.

For companies with \~50–200 employees, minimal internal IT, and tools like GitHub, Gmail, Vault, AWS, and Graylog — what are your go-to SSO solutions?

Looking for feedback on:

Ease of integration (SAML/OIDC)
Multi-IDP support
Support for SCIM provisioning
Transparent, scalable pricing (no bloated enterprise overhead)
Good developer experience

Here’s a list I often see in conversations:

Azure AD (Entra ID)
[Keycloak](https://www.keycloak.org/)
Authentik
[WorkOS](https://workos.com/)
SSOJet

Would love to hear your experience with any of these or other favorites — especially across multi-tenant or external user auth use cases.

https://redd.it/1kw0uvh
@r_devops
Docker images works fine on local but not on gcp.

Hi everyone,

I’m running a Docker image with an old Ruby version on Debian. It works locally with Docker Compose, but fails with “Service Unavailable” on GCP Cloud Run. The issue seems to be incompatibility with the latest Ubuntu version used in the infra.

I can’t upgrade Ruby due to legacy constraints—we’re rewriting it in another language. Any suggestions for getting this to run on Cloud Run as-is?

Thanks!



https://redd.it/1kw0fpi
@r_devops
Multi-stage release pipeline, how to require one approval from each of two separate groups?

Hi all I am trying to implement a release pipeline using Azure DevOps and using yaml.

I have a requirement where two groups need to manually approve a release. At least one person per group must approve. So I deploy to an environment like `staging` or `prod`, but before deployment I want a manual approval gate where at least one person from `group a` and at least one person from `group b` need to manually approve.

I want to avoid using the Classic Release UI as I want the whole process to be code-defined in yaml.

I have tried looking at yaml definition but I did not get very far, to be honest if I could version control groups here that would be a really nice feature. Using ManualValidation@0 in yaml sounded interesting but given that anyone can approve and no concept of groups as far as I can tell so this is out of the question.

I have tried looking into `environments` with approval checks but Azure DevOps only supports assigning a single group to an environment’s approval gate. That doesn't seem to allow me to enforce the "one per group" logic.

I came across the idea of using two environments per stage eg `staging-group-a` and `staging-group-b`. I was also thinking to have two representatives for the workflow and let them defer approval if necessary. Both options sound clunky and I think I prefer the latter one the most.

Is there a simple way to solve this problem? It feels more complicated than it has to be.

https://redd.it/1kw5khg
@r_devops