on going distroless before you harmonize your company's base OS/images.
Spend your days getting everyone using the same Alpine/Debian/Ubuntu/whatever image first - your challenge of moving these containers to distroless/hardened images will be 100x easier if you do.
https://redd.it/1jclbjk
@r_devops
Spend your days getting everyone using the same Alpine/Debian/Ubuntu/whatever image first - your challenge of moving these containers to distroless/hardened images will be 100x easier if you do.
https://redd.it/1jclbjk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
From Where should I start a d what should I learn
So I'm a BTech IT student and after trying web development, DSA , I know these are not for me. I started learning about devops and I gained interest in it . So please suggest me some resources from where I should learn and what I should learn in particular order and suggest free resources because I've money problem.
https://redd.it/1jco1zb
@r_devops
So I'm a BTech IT student and after trying web development, DSA , I know these are not for me. I started learning about devops and I gained interest in it . So please suggest me some resources from where I should learn and what I should learn in particular order and suggest free resources because I've money problem.
https://redd.it/1jco1zb
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Github actions - Runners giving role assignments
Hello :)
After researching best practices for assigning roles in an IaC workflow, I haven't found a clear, definitive "proper way" to do it.
Initially, I considered using a broker system with PIM and JIT for Azure, but this doesn’t seem to work with workload identities. While it’s possible to simulate this with code, it feels a bit janky.
Has anyone tested different approaches to handle this?
Essentially, I want to avoid giving a workload identity permanent role assignment capabilities. Is this "just the way its done", or is there a better way to achieve it?
https://redd.it/1jcsz7i
@r_devops
Hello :)
After researching best practices for assigning roles in an IaC workflow, I haven't found a clear, definitive "proper way" to do it.
Initially, I considered using a broker system with PIM and JIT for Azure, but this doesn’t seem to work with workload identities. While it’s possible to simulate this with code, it feels a bit janky.
Has anyone tested different approaches to handle this?
Essentially, I want to avoid giving a workload identity permanent role assignment capabilities. Is this "just the way its done", or is there a better way to achieve it?
https://redd.it/1jcsz7i
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Help with a School Project on Cloud Management
Hey everyone! 👋
If you work with AWS, Azure, or GCP, I’d love to get your insights on cloud infrastructure management! I’m running a short survey to understand how engineers and DevOps teams handle cloud optimisation, automation, and security.
The survey is completely anonymous, and I’d really appreciate your time!
👉 **Take the survey here**
Thanks in advance for your time!
https://redd.it/1jcuox1
@r_devops
Hey everyone! 👋
If you work with AWS, Azure, or GCP, I’d love to get your insights on cloud infrastructure management! I’m running a short survey to understand how engineers and DevOps teams handle cloud optimisation, automation, and security.
The survey is completely anonymous, and I’d really appreciate your time!
👉 **Take the survey here**
Thanks in advance for your time!
https://redd.it/1jcuox1
@r_devops
Typeform
My new form
Turn data collection into an experience with Typeform. Create beautiful online forms, surveys, quizzes, and so much more. Try it for FREE.
k8s monitoring costs is exploding at my startup
Please let me know if this is the correct place to post.
I'm in a bit of a situation that I wonder if any of you can relate to. I'm the fractional CTO at a rapidly growing startup (100+ microservices, elasticsearch k8s), and our observability costs are absolutely DESTROYING our cloud budget.
We're currently paying close to $80K/month just for APM/logging/metrics (not even including infrastructure costs 😭).
I've been diving deep into eBPF-based monitoring solutions as a potential way out of this mess. The promise of "monitor everything with zero code instrumentation" sounds almost too good to be true.
Has anyone here successfully made the switch from traditional APM tools (Datadog/New Relic) to eBPF-based monitoring in production?
Specifically, I'm curious about:
\- Real-world performance overhead on nodes
\- How complete is the visibility really? (especially for things like HTTP payload inspection)
\- Any gotchas with running in production?
\- Actual cost savings numbers if you're willing to share
Would love to hear your war stories and insights.
https://redd.it/1jcym3x
@r_devops
Please let me know if this is the correct place to post.
I'm in a bit of a situation that I wonder if any of you can relate to. I'm the fractional CTO at a rapidly growing startup (100+ microservices, elasticsearch k8s), and our observability costs are absolutely DESTROYING our cloud budget.
We're currently paying close to $80K/month just for APM/logging/metrics (not even including infrastructure costs 😭).
I've been diving deep into eBPF-based monitoring solutions as a potential way out of this mess. The promise of "monitor everything with zero code instrumentation" sounds almost too good to be true.
Has anyone here successfully made the switch from traditional APM tools (Datadog/New Relic) to eBPF-based monitoring in production?
Specifically, I'm curious about:
\- Real-world performance overhead on nodes
\- How complete is the visibility really? (especially for things like HTTP payload inspection)
\- Any gotchas with running in production?
\- Actual cost savings numbers if you're willing to share
Would love to hear your war stories and insights.
https://redd.it/1jcym3x
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Most recognized/useful certs in DevOps?
Hello, sitting at about 5 YOE as a cloud/DevOps engineer. Have a good grasp of everything in the cloud, got a bunch of AWS and Azure certs.
Have been given some professional development time at work and they generally like us to get certificates. I was wondering if anyone could suggest a certification that is generally highly regarded in DevOps? Was leaning towards a kubernetes or possibly redhat cert.
https://redd.it/1jd1fqh
@r_devops
Hello, sitting at about 5 YOE as a cloud/DevOps engineer. Have a good grasp of everything in the cloud, got a bunch of AWS and Azure certs.
Have been given some professional development time at work and they generally like us to get certificates. I was wondering if anyone could suggest a certification that is generally highly regarded in DevOps? Was leaning towards a kubernetes or possibly redhat cert.
https://redd.it/1jd1fqh
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Hot Take: Platform Engineering Is NOT the same as DevOps
I see this question so many times. I figured, wtf, why not just write a blog about it based on my experience.
Blog post link: https://ctrlplane.dev/blog/what-is-platform-engineering
You can read the full breakdown there, but here are my hot takes:
DevOps is the 'why', Platform Engineering is the 'how'. DevOps is a philosophy (at the very least its suppose to be). Collaboration, automation, the whole shebang. But it can be kinda vague. Platform Engineering is about actually building the tools and platforms to make that happen. Think of it as putting concrete under the DevOps ideals.
It's not just renaming DevOps. Sure, there's overlap. But Platform Engineering is more focused on building standardized, scalable platforms. It's about giving devs a consistent and efficient experience.
If you're in DevOps, you're probably doing some Platform Engineering already. If you're automating infrastructure, building CI/CD pipelines, or creating self-service tools, you're on the right track.
The future is platforms. As things get more complex (microservices, cloud, etc.), Platform Engineering is gonna be even more crucial. Companies need dedicated teams to build these platforms.
Basically, DevOps is the idea, SRE is the reliability implementation, and Platform Engineering is the implementation that improves the developer experience.
Okay, this might be a bit pedantic, but Im in software engineering and thats what we do.
https://redd.it/1jd307e
@r_devops
I see this question so many times. I figured, wtf, why not just write a blog about it based on my experience.
Blog post link: https://ctrlplane.dev/blog/what-is-platform-engineering
You can read the full breakdown there, but here are my hot takes:
DevOps is the 'why', Platform Engineering is the 'how'. DevOps is a philosophy (at the very least its suppose to be). Collaboration, automation, the whole shebang. But it can be kinda vague. Platform Engineering is about actually building the tools and platforms to make that happen. Think of it as putting concrete under the DevOps ideals.
It's not just renaming DevOps. Sure, there's overlap. But Platform Engineering is more focused on building standardized, scalable platforms. It's about giving devs a consistent and efficient experience.
If you're in DevOps, you're probably doing some Platform Engineering already. If you're automating infrastructure, building CI/CD pipelines, or creating self-service tools, you're on the right track.
The future is platforms. As things get more complex (microservices, cloud, etc.), Platform Engineering is gonna be even more crucial. Companies need dedicated teams to build these platforms.
Basically, DevOps is the idea, SRE is the reliability implementation, and Platform Engineering is the implementation that improves the developer experience.
Okay, this might be a bit pedantic, but Im in software engineering and thats what we do.
https://redd.it/1jd307e
@r_devops
ctrlplane.dev
What is Platform Engineering?
A comprehensive guide to platform engineering, exploring its role in modern software development, key differences from DevOps and SRE, and how it's shaping the future of infrastructure management.
Streamlining Secrets Management for AWS Lambda with AWS Secrets Manager & TypeScript
Hello r/devops,
I’d like to share my latest video tutorial on securing AWS Lambda functions using AWS Secrets Manager in a TypeScript monorepo. This method centralizes secret management, improves security, and ensures cost efficiency—key aspects for modern DevOps practices.
Watch the video: https://youtu.be/I5wOfGrxZWc
Access the source code here: https://github.com/radzionc/radzionkit
I appreciate any thoughts or feedback you may have. Thanks for reading!
https://redd.it/1jd4cck
@r_devops
Hello r/devops,
I’d like to share my latest video tutorial on securing AWS Lambda functions using AWS Secrets Manager in a TypeScript monorepo. This method centralizes secret management, improves security, and ensures cost efficiency—key aspects for modern DevOps practices.
Watch the video: https://youtu.be/I5wOfGrxZWc
Access the source code here: https://github.com/radzionc/radzionkit
I appreciate any thoughts or feedback you may have. Thanks for reading!
https://redd.it/1jd4cck
@r_devops
YouTube
Mastering AWS Secrets Manager in a TypeScript Monorepo
In this video, Radzion explains how to securely manage sensitive information in AWS Lambda using AWS Secrets Manager and a TypeScript monorepo. Learn how to centralize secret access, ensure type safety, and optimize costs with a dedicated package.
Check…
Check…
I chose docker swarm
Wanted to know your opinion on this setup i made.
So i got hired by this company who has a lot of mobile apps and websites. All backends were dockerized and put on one mega ec2 instance, bound to a different port on the machine with a nginx reverse proxy listening on the domain and sending traffic to the respective port on localhost.
The server's load was through the roof and they wanted to add more and more backends.
One more thing of relevance here, I'm the only devops guy there, the rest are backend developers with little knowledge in docker or frontend devs with no knowledge in docker.
The solution i proposed, docker swarm over multiple ec2 instances.
First i used nginx docker instead of installing it on the instance directly, one replica per instance.
Second, all internet facing app is added to the nginx docker network. This eliminates the need to bind it on the host and can be reached internally from nginx container using stackname_servicename:serviceport.
The service can have a second network if it has any other services.
We can almost use the same docker compose files that were used before, aside from the few new commands devs have to learn, they can all understand the infra.
Now i could set up ASG in aws, but i would prefer to do it manual for now, i prepared a terraform/ansible script that provisions the leader/nodes of the swarm and i can simply increase the number of nodes and it will be providioned and configured into the swarm.
For dns, i want to add every node public ip to every domain (now this bit surely needs improvement) so that it reaches the nginx on the node itself.
Databases are still a problem as i chose i put them all on the leader node so i would preserve the data on restarts. I chose this over doing ebs multi-attach or efs.
Let me know your opinion on this and how you would improve it
https://redd.it/1jd75nc
@r_devops
Wanted to know your opinion on this setup i made.
So i got hired by this company who has a lot of mobile apps and websites. All backends were dockerized and put on one mega ec2 instance, bound to a different port on the machine with a nginx reverse proxy listening on the domain and sending traffic to the respective port on localhost.
The server's load was through the roof and they wanted to add more and more backends.
One more thing of relevance here, I'm the only devops guy there, the rest are backend developers with little knowledge in docker or frontend devs with no knowledge in docker.
The solution i proposed, docker swarm over multiple ec2 instances.
First i used nginx docker instead of installing it on the instance directly, one replica per instance.
Second, all internet facing app is added to the nginx docker network. This eliminates the need to bind it on the host and can be reached internally from nginx container using stackname_servicename:serviceport.
The service can have a second network if it has any other services.
We can almost use the same docker compose files that were used before, aside from the few new commands devs have to learn, they can all understand the infra.
Now i could set up ASG in aws, but i would prefer to do it manual for now, i prepared a terraform/ansible script that provisions the leader/nodes of the swarm and i can simply increase the number of nodes and it will be providioned and configured into the swarm.
For dns, i want to add every node public ip to every domain (now this bit surely needs improvement) so that it reaches the nginx on the node itself.
Databases are still a problem as i chose i put them all on the leader node so i would preserve the data on restarts. I chose this over doing ebs multi-attach or efs.
Let me know your opinion on this and how you would improve it
https://redd.it/1jd75nc
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
The eternal struggle
Tech is easy. You have a problem, you troubleshoot, you fix it. Rinse and repeat. But explaining that problem to someone who isn’t knee-deep in logs and YAML files? That’s where I crash and burn.
I’ve been working in DevOps for a while now, and the more I progress technically, the more I realize that my soft skills are lagging hard. Talking to stakeholders, justifying decisions, even something as basic as daily stand-ups.half the time, I feel like I’m either over-explaining or not making sense at all. It’s like my brain refuses to translate tech into human language.
And it’s not just a work thing. The same awkwardness bleeds into my personal life. Making conversation? small talk? networking? It feels like an impossible task. Meanwhile I see colleagues who just get people. They navigate meetings like it’s a dance, while I’m out here stepping on toes and knocking over chairs.
I know soft skills are a muscle that needs training, but imo it requires actual effort and consistency, and I’d rather refactor a spaghetti-code terraform module than actively work on my communication skills.
https://redd.it/1jd95s6
@r_devops
Tech is easy. You have a problem, you troubleshoot, you fix it. Rinse and repeat. But explaining that problem to someone who isn’t knee-deep in logs and YAML files? That’s where I crash and burn.
I’ve been working in DevOps for a while now, and the more I progress technically, the more I realize that my soft skills are lagging hard. Talking to stakeholders, justifying decisions, even something as basic as daily stand-ups.half the time, I feel like I’m either over-explaining or not making sense at all. It’s like my brain refuses to translate tech into human language.
And it’s not just a work thing. The same awkwardness bleeds into my personal life. Making conversation? small talk? networking? It feels like an impossible task. Meanwhile I see colleagues who just get people. They navigate meetings like it’s a dance, while I’m out here stepping on toes and knocking over chairs.
I know soft skills are a muscle that needs training, but imo it requires actual effort and consistency, and I’d rather refactor a spaghetti-code terraform module than actively work on my communication skills.
https://redd.it/1jd95s6
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Roast My SaaS Monorepo Refactor (DDD + Nx) - Where Do Migrations & Databases Go?
Hey r/devops, roast my attempt at refactoring my SaaS monorepo! I’m knee-deep in an Nx setup with a Telegram bot (and future web app/API), trying to apply DDD and clean architecture. My old aws_services.py was a dumpster fire of mixed logic lol.
I am seeking some advice,
Context: I run an image-editing SaaS (\~$5K MRR, 30% monthly growth) I built post-uni with no formal AWS/devops training. It’s a Telegram bot for marketing agencies, using AI to process uploads. Currently at 100-150 daily users, hosted on AWS (EC2, DynamoDB, S3, Lambda). I’m refactoring to add an affiliate system and prep for a PostgreSQL switch, but my setup’s a mess.
Technical Setup:
Nx Monorepo:
/apps/telegram-bot: Bot logic, still has a bloated aws_services.py.
/apps/infra: AWS CDK for DynamoDB/S3 CloudFormation.
/libs/core/domain: User, Affiliate models, services, abstract repos.
/libs/infrastructure: DynamoDB repos, S3 storage.
Database: Single DynamoDB (UserTable, planning Affiliates).
Goal: Decouple domain logic, add affiliates (clicks/revenue), abstract DB for future Postgres.
Problems:
Migrations feel weird in /apps. DB is for the business, not just the bot.
One DB or many? I’ve got a Telegram bot now, but a web app, API, and second bot are coming.
Questions:
1. Migrations in a Monorepo: Sticking them in /libs/infrastructure/migrations (e.g., DynamoDB scripts)—good spot, or should they go in /apps/infra with CDK?
2. Database Strategy: One central DB (DynamoDB) for all apps now, hybrid (central + app-specific) later. When do you split, and how do you sync data?
3. DDD + Nx: How do you balance app-centric /apps with domain-centric DDD? Feels clunky.
Specific Points of Interest:
Migrations: Centralize them or tie to infra deployment? Tools for DynamoDB → Postgres?
DB Scalability: Stick with one DB or go per-app as I grow? (e.g., Telegram’s telegram\_user\_id vs. web app’s email).
Best Practices: Tips for a DDD monorepo with multiple apps?
Roast away lol. What am I screwing up? How do I make this indestructible as I move from alpha to beta? DM me if you’re keen to collab. My 0-1 and sales skills are solid, but 1-100 robustness is my weak spot. Thanks for any wisdom!
https://redd.it/1jd94kp
@r_devops
Hey r/devops, roast my attempt at refactoring my SaaS monorepo! I’m knee-deep in an Nx setup with a Telegram bot (and future web app/API), trying to apply DDD and clean architecture. My old aws_services.py was a dumpster fire of mixed logic lol.
I am seeking some advice,
Context: I run an image-editing SaaS (\~$5K MRR, 30% monthly growth) I built post-uni with no formal AWS/devops training. It’s a Telegram bot for marketing agencies, using AI to process uploads. Currently at 100-150 daily users, hosted on AWS (EC2, DynamoDB, S3, Lambda). I’m refactoring to add an affiliate system and prep for a PostgreSQL switch, but my setup’s a mess.
Technical Setup:
Nx Monorepo:
/apps/telegram-bot: Bot logic, still has a bloated aws_services.py.
/apps/infra: AWS CDK for DynamoDB/S3 CloudFormation.
/libs/core/domain: User, Affiliate models, services, abstract repos.
/libs/infrastructure: DynamoDB repos, S3 storage.
Database: Single DynamoDB (UserTable, planning Affiliates).
Goal: Decouple domain logic, add affiliates (clicks/revenue), abstract DB for future Postgres.
Problems:
Migrations feel weird in /apps. DB is for the business, not just the bot.
One DB or many? I’ve got a Telegram bot now, but a web app, API, and second bot are coming.
Questions:
1. Migrations in a Monorepo: Sticking them in /libs/infrastructure/migrations (e.g., DynamoDB scripts)—good spot, or should they go in /apps/infra with CDK?
2. Database Strategy: One central DB (DynamoDB) for all apps now, hybrid (central + app-specific) later. When do you split, and how do you sync data?
3. DDD + Nx: How do you balance app-centric /apps with domain-centric DDD? Feels clunky.
Specific Points of Interest:
Migrations: Centralize them or tie to infra deployment? Tools for DynamoDB → Postgres?
DB Scalability: Stick with one DB or go per-app as I grow? (e.g., Telegram’s telegram\_user\_id vs. web app’s email).
Best Practices: Tips for a DDD monorepo with multiple apps?
Roast away lol. What am I screwing up? How do I make this indestructible as I move from alpha to beta? DM me if you’re keen to collab. My 0-1 and sales skills are solid, but 1-100 robustness is my weak spot. Thanks for any wisdom!
https://redd.it/1jd94kp
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Monitoring terraform flow
What's the correct way to monitor terrafrom flow win s3 bucket as a back-end with big devops team?
Is there an option to have easily human readable output?
Or the best way just to use something like Atlantis and just abstain from using terraform from local machines?
https://redd.it/1jdbchn
@r_devops
What's the correct way to monitor terrafrom flow win s3 bucket as a back-end with big devops team?
Is there an option to have easily human readable output?
Or the best way just to use something like Atlantis and just abstain from using terraform from local machines?
https://redd.it/1jdbchn
@r_devops
GitHub
GitHub - runatlantis/atlantis: Terraform Pull Request Automation
Terraform Pull Request Automation. Contribute to runatlantis/atlantis development by creating an account on GitHub.
Let's talk about remediating cloud security issues
Let’s say an issue pops up in a cloud security tool (Wiz, Orca, Prisma Cloud, etc.).
I’m trying to understand what happens after an alert is prioritized and added to Jira for remediation by DevOps/DevSecOps.
What takes the most time in the remediation process? I assume it depends a lot on the alert type. I’d imagine that figuring out the impact of a change on existing infrastructure and applications takes a while—but does it? Is there anything else that slows things down?
Also, do the "simple" alerts—like closing an S3 bucket, restricting an IAM role, or changing a policy—still take time to remediate?
Thanks!
Disclaimer - I am now building a security startup and I want to understand this problem better.
https://redd.it/1jdboxq
@r_devops
Let’s say an issue pops up in a cloud security tool (Wiz, Orca, Prisma Cloud, etc.).
I’m trying to understand what happens after an alert is prioritized and added to Jira for remediation by DevOps/DevSecOps.
What takes the most time in the remediation process? I assume it depends a lot on the alert type. I’d imagine that figuring out the impact of a change on existing infrastructure and applications takes a while—but does it? Is there anything else that slows things down?
Also, do the "simple" alerts—like closing an S3 bucket, restricting an IAM role, or changing a policy—still take time to remediate?
Thanks!
Disclaimer - I am now building a security startup and I want to understand this problem better.
https://redd.it/1jdboxq
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How toil killed my team
When I first stepped into the world of Site Reliability Engineering, I was introduced to the concept of toil. Google’s SRE handbook defines toil as anything repetitive, manual, automatable, reactive, and scaling with service growth—but in reality, it’s much worse than that. Toil isn’t just a few annoying maintenance tickets in Jira; it’s a tax on innovation. It’s the silent killer that keeps engineers stuck in maintenance mode instead of building meaningful solutions.
I saw this firsthand when I joined a new team plagued by recurring Jira tickets from a failing
This kind of toil doesn’t happen overnight. It’s the result of years of short-term fixes that snowball into long-term operational debt. When firefighting becomes the norm, attrition spikes, and innovation dies. The team stops improving things because they’re too busy keeping the lights on. Toil is self-inflicted, but the first step to recovery is recognizing it exists and having the will to automate your way out of it.
https://redd.it/1jdd63a
@r_devops
When I first stepped into the world of Site Reliability Engineering, I was introduced to the concept of toil. Google’s SRE handbook defines toil as anything repetitive, manual, automatable, reactive, and scaling with service growth—but in reality, it’s much worse than that. Toil isn’t just a few annoying maintenance tickets in Jira; it’s a tax on innovation. It’s the silent killer that keeps engineers stuck in maintenance mode instead of building meaningful solutions.
I saw this firsthand when I joined a new team plagued by recurring Jira tickets from a failing
dnsmasq service on their autoscaling GitLab runner VMs. The alarms never stopped. At first, I was horrified when the proposed fix was simply restarting the daemon and marking the ticket as resolved. The team had been so worn down by years of toil and firefighting that they’d rather SSH into a VM and run a command than investigate the root cause. They weren’t lazy—they were fatigued.This kind of toil doesn’t happen overnight. It’s the result of years of short-term fixes that snowball into long-term operational debt. When firefighting becomes the norm, attrition spikes, and innovation dies. The team stops improving things because they’re too busy keeping the lights on. Toil is self-inflicted, but the first step to recovery is recognizing it exists and having the will to automate your way out of it.
https://redd.it/1jdd63a
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
DDOS, what's your story ? How much ? Who ? What do you do against it ? any horror stories to share ?
I'm curious to hear about your DevOps experience regarding DDoS attacks.
How often do you encounter DDoS attacks, and what type of DDoS are they (L7, for example)?
Have you noticed specific patterns or events that trigger these attacks?
What tools do you use to defend against them?
Do you have any horror stories to share?
https://redd.it/1jddc3f
@r_devops
I'm curious to hear about your DevOps experience regarding DDoS attacks.
How often do you encounter DDoS attacks, and what type of DDoS are they (L7, for example)?
Have you noticed specific patterns or events that trigger these attacks?
What tools do you use to defend against them?
Do you have any horror stories to share?
https://redd.it/1jddc3f
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Grafana Alloy: My Promtail Migration Journey (with HCL configs ready to steal)
Hey fellow DevOps warriors,
After putting it off for months (fear of change is real!), I finally bit the bullet and migrated from Promtail to Grafana Alloy for our production logging stack.
Thought I'd share what I learned in case anyone else is on the fence.
Highlights:
- Complete HCL configs you can copy/paste (tested in prod)
- How to collect Linux journal logs alongside K8s logs
- Trick to capture K8s cluster events as logs
- Setting up VictoriaLogs as the backend instead of Loki
- Bonus: Using Alloy for OpenTelemetry tracing to reduce agent bloat
Nothing groundbreaking here, but hopefully saves someone a few hours of config debugging.
The Alloy UI diagnostics alone made the switch worthwhile for troubleshooting pipeline issues.
Full write-up:
https://developer-friendly.blog/blog/2025/03/17/migration-from-promtail-to-alloy-the-what-the-why-and-the-how/
Not affiliated with Grafana in any way - just sharing my experience.
Curious if others have made the jump yet?
https://redd.it/1jdhqnk
@r_devops
Hey fellow DevOps warriors,
After putting it off for months (fear of change is real!), I finally bit the bullet and migrated from Promtail to Grafana Alloy for our production logging stack.
Thought I'd share what I learned in case anyone else is on the fence.
Highlights:
- Complete HCL configs you can copy/paste (tested in prod)
- How to collect Linux journal logs alongside K8s logs
- Trick to capture K8s cluster events as logs
- Setting up VictoriaLogs as the backend instead of Loki
- Bonus: Using Alloy for OpenTelemetry tracing to reduce agent bloat
Nothing groundbreaking here, but hopefully saves someone a few hours of config debugging.
The Alloy UI diagnostics alone made the switch worthwhile for troubleshooting pipeline issues.
Full write-up:
https://developer-friendly.blog/blog/2025/03/17/migration-from-promtail-to-alloy-the-what-the-why-and-the-how/
Not affiliated with Grafana in any way - just sharing my experience.
Curious if others have made the jump yet?
https://redd.it/1jdhqnk
@r_devops
developer-friendly.blog
Migration From Promtail to Alloy: The What, the Why, and the How - Developer Friendly Blog
Learn why and how to migrate from Promtail to Grafana Alloy for your logging infrastructure, with step-by-step configuration examples and best practices.
How many of you fellow devopses actually do meaningful work ?
I'm not talking about "some" work, but actually meaningful work like:
- migrating big important workloads
- solving high scaling issues
- setting up stuff from ground up (tenants for clients that pay a lot)
- managing fleets of k8s clusters
---
Recently I joined a team that supports some e-commerce platform, but majority of work is doing small fixes here or there, pay is good and I have a lot of free time, but I'm wondering, how many ppl are doing barely anything like me and how many are doing the heavy lifting.
https://redd.it/1jdiygl
@r_devops
I'm not talking about "some" work, but actually meaningful work like:
- migrating big important workloads
- solving high scaling issues
- setting up stuff from ground up (tenants for clients that pay a lot)
- managing fleets of k8s clusters
---
Recently I joined a team that supports some e-commerce platform, but majority of work is doing small fixes here or there, pay is good and I have a lot of free time, but I'm wondering, how many ppl are doing barely anything like me and how many are doing the heavy lifting.
https://redd.it/1jdiygl
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Advice on CI/CD setup with GitHub Actions
I'll try to keep this short. We use GitHub as code repository and therefore I decided to use GH action for CI/CD pipelines. I don't have much experience with all the devops stuff but I am currently trying to learn it.
We have multiple services, each in its own repository (this is pretty new, we've had a mono repository before and therefore the following problem didn't exist until now). All of these repos have at least 3 branches: dev, staging and production. Now, I need the following: Whenever I push to staging or production, I want it to basically redeploy to AWS using Kubernetes (with kustomize for segregating the environments).
My intuitive approach was to make a new "infra" repository where I can centrally manage my deployment workflow which basically consists of these steps: Setting up AWS credentials, building images and pushing it to the AWS registry (ECR), applying K8s kustomize which detects the new image and accordingly redeploys them.
I initially thought introducing the infra repo to seperate the concern (business logic vs infra code) and make the infra stuff more reusable would be a great idea, but I realized fast that this come with some issues: The image build process has to take place in the "service repo", because it has to access the Dockerfile. However, the infra process has to take place in the infra repo because this is where I have all my k8s files. Ultimately this somehow leads to a contradiction, because I found out that if I call the infra workflow from the service repository, it will also be executed in the context of the service repo and therefore I don't have access to all the k8s files in the infra repo.
My conclusion is that I would somehow have to make the image build and push in the service repo. Consequently the infra repo must listen to this and somehow gets triggered to do the redeployments. Or should I just checkout another repo?
Sorry if something is misleading - as I said, I am pretty new to devops. I'd appreciate any input from you guys, it's important to me to somehow follow best practices so don't be gentle with me.
Edit: typos
https://redd.it/1jdksxo
@r_devops
I'll try to keep this short. We use GitHub as code repository and therefore I decided to use GH action for CI/CD pipelines. I don't have much experience with all the devops stuff but I am currently trying to learn it.
We have multiple services, each in its own repository (this is pretty new, we've had a mono repository before and therefore the following problem didn't exist until now). All of these repos have at least 3 branches: dev, staging and production. Now, I need the following: Whenever I push to staging or production, I want it to basically redeploy to AWS using Kubernetes (with kustomize for segregating the environments).
My intuitive approach was to make a new "infra" repository where I can centrally manage my deployment workflow which basically consists of these steps: Setting up AWS credentials, building images and pushing it to the AWS registry (ECR), applying K8s kustomize which detects the new image and accordingly redeploys them.
I initially thought introducing the infra repo to seperate the concern (business logic vs infra code) and make the infra stuff more reusable would be a great idea, but I realized fast that this come with some issues: The image build process has to take place in the "service repo", because it has to access the Dockerfile. However, the infra process has to take place in the infra repo because this is where I have all my k8s files. Ultimately this somehow leads to a contradiction, because I found out that if I call the infra workflow from the service repository, it will also be executed in the context of the service repo and therefore I don't have access to all the k8s files in the infra repo.
My conclusion is that I would somehow have to make the image build and push in the service repo. Consequently the infra repo must listen to this and somehow gets triggered to do the redeployments. Or should I just checkout another repo?
Sorry if something is misleading - as I said, I am pretty new to devops. I'd appreciate any input from you guys, it's important to me to somehow follow best practices so don't be gentle with me.
Edit: typos
https://redd.it/1jdksxo
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
I Did analysis of DevOps job market for 2025
Hi Folks,
beginning of 2024 I did a pet project and scraped around 700 Linkedin DevOps jobs post. Still had the data and wanted to do smt with it so Yesterday I did compared it to March 2025.
Here are findings coding is required much more than it used to.. Golang went up 13%, Python went up 9% as well as JS.
Hate to say but Jenkins went up idk why but my guess less people work with it and there is a shortage.
there are other things too like certificates are less required now or mentioned (by a lot)
anyway here is the article https://prepare.sh/articles/devops-job-market-trends-2025
I advice you to check it out but just in case you want very minimal version:
TL;DR
Go +13%
Python +9%
Jenkins +6.8% (almost 7%)
Terraform +9%
Flux down, Argo up (slightly)
Certs are mentioned way less than they used to by 15-20%. Everyone seems to got one and they get are saturated.
https://redd.it/1jdo4zd
@r_devops
Hi Folks,
beginning of 2024 I did a pet project and scraped around 700 Linkedin DevOps jobs post. Still had the data and wanted to do smt with it so Yesterday I did compared it to March 2025.
Here are findings coding is required much more than it used to.. Golang went up 13%, Python went up 9% as well as JS.
Hate to say but Jenkins went up idk why but my guess less people work with it and there is a shortage.
there are other things too like certificates are less required now or mentioned (by a lot)
anyway here is the article https://prepare.sh/articles/devops-job-market-trends-2025
I advice you to check it out but just in case you want very minimal version:
TL;DR
Go +13%
Python +9%
Jenkins +6.8% (almost 7%)
Terraform +9%
Flux down, Argo up (slightly)
Certs are mentioned way less than they used to by 15-20%. Everyone seems to got one and they get are saturated.
https://redd.it/1jdo4zd
@r_devops
Prepare.sh
DevOps Job Market Trends [2025]
DevOps job market trends for 2025 analysis showing Terraform, Python, and Kubernetes dominating job requirements while Golang surges by 13%. Essential insights for tech professionals.
How’s MAcbook air M4 for a software engineer
I'm thinking about getting the MacBook Air M4 for my everyday engineering tasks. I don’t do anything too intense—just running web apps, scripts, and a few Docker containers on my local machine. It’s mostly standard DevOps stuff. My work leans more toward DevOps and cloud computing, and I usually run the heavier applications on a remote server.
For those with a MacBook Air, do you think it’s a good fit for my typical workload?
https://redd.it/1jdvppg
@r_devops
I'm thinking about getting the MacBook Air M4 for my everyday engineering tasks. I don’t do anything too intense—just running web apps, scripts, and a few Docker containers on my local machine. It’s mostly standard DevOps stuff. My work leans more toward DevOps and cloud computing, and I usually run the heavier applications on a remote server.
For those with a MacBook Air, do you think it’s a good fit for my typical workload?
https://redd.it/1jdvppg
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Best devops tutorials that are equivalent or almost equivalent to actual work experience
In my experience, practical tutorials are the best thing to become ready to take on any job, so I am wondering what are the best practical tutorials for devops.
https://redd.it/1jdvmez
@r_devops
In my experience, practical tutorials are the best thing to become ready to take on any job, so I am wondering what are the best practical tutorials for devops.
https://redd.it/1jdvmez
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community