Built a tool to stop wasting hours debugging Kubernetes config issues
Spent way too many late nights debugging "mysterious" K8s issues that turned out to be:
- Typos in resource references
- Missing ConfigMaps/Secrets
- Broken service selectors
- Security misconfigurations
- Docker images that don't exist or have wrong architecture
Built Kogaro to catch these before they cause incidents. It's like a linter for your running cluster.
Key insight: Most validation tools focus on policy compliance. Kogaro focuses on operational reality - what actually breaks in production.
Features:
- 60+ validation types for common failure patterns
- Docker image validation (registry existence, architecture compatibility, version)
- Structured error codes (KOGARO-XXX-YYY) for automated handling
- Prometheus metrics for monitoring trends
- Production-ready (HA, leader election, etc.)
Takes 5 minutes to deploy, immediately starts catching issues.
Latest release v0.4.2: https://github.com/topiaruss/kogaro
Demo: https://kogaro.dev
What's your most annoying "silent failure" pattern in K8s?
https://redd.it/1l8qwyq
@r_devops
Spent way too many late nights debugging "mysterious" K8s issues that turned out to be:
- Typos in resource references
- Missing ConfigMaps/Secrets
- Broken service selectors
- Security misconfigurations
- Docker images that don't exist or have wrong architecture
Built Kogaro to catch these before they cause incidents. It's like a linter for your running cluster.
Key insight: Most validation tools focus on policy compliance. Kogaro focuses on operational reality - what actually breaks in production.
Features:
- 60+ validation types for common failure patterns
- Docker image validation (registry existence, architecture compatibility, version)
- Structured error codes (KOGARO-XXX-YYY) for automated handling
- Prometheus metrics for monitoring trends
- Production-ready (HA, leader election, etc.)
Takes 5 minutes to deploy, immediately starts catching issues.
Latest release v0.4.2: https://github.com/topiaruss/kogaro
Demo: https://kogaro.dev
What's your most annoying "silent failure" pattern in K8s?
https://redd.it/1l8qwyq
@r_devops
GitHub
GitHub - topiaruss/kogaro: Kogaro - Kubernetes Configuration Hygiene Agent
Kogaro - Kubernetes Configuration Hygiene Agent. Contribute to topiaruss/kogaro development by creating an account on GitHub.
Anyone else learning Python just to stop copy-pasting random shell commands?
When i started working with cloud stuff, i kept running into long shell commands and YAML configs I didn’t fully understand.
At some point I realized: if I learned Python properly, I could actually automate half of it ...... and understand what i was doing instead of blindly copy-pasting scripts from Stack Overflow.
So I’ve been focusing more on Python scripting for small cloud tasks:
→ launching test servers
→ formatting JSON from AWS CLI
→ even writing little cleanup bots for unused resources
Still super early in the journey, but honestly, using Python this way feels way more rewarding than just “finishing tutorials.”
Anyone else taking this path — learning Python because of cloud/infra work?
Curious how you’re applying it in real projects.
https://redd.it/1l8uhvk
@r_devops
When i started working with cloud stuff, i kept running into long shell commands and YAML configs I didn’t fully understand.
At some point I realized: if I learned Python properly, I could actually automate half of it ...... and understand what i was doing instead of blindly copy-pasting scripts from Stack Overflow.
So I’ve been focusing more on Python scripting for small cloud tasks:
→ launching test servers
→ formatting JSON from AWS CLI
→ even writing little cleanup bots for unused resources
Still super early in the journey, but honestly, using Python this way feels way more rewarding than just “finishing tutorials.”
Anyone else taking this path — learning Python because of cloud/infra work?
Curious how you’re applying it in real projects.
https://redd.it/1l8uhvk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
❤1
8 YOE all at the same company Is my resume senior-worthy at a tech company?
Hey all,
I’ve been working full-time for over 8 years at the same Fortune 500 non-tech company (and interned at a different one prior to that), but I’m finally ready to look elsewhere because of being what I perceive as underpaid relative to the value I can create. Here’s my anonymized resume:
https://imgur.com/a/nd3T1MA
I’ve been in 4 different organizations within the company, but I can’t tell whether I am actually going to get looks at FAANG-adjacent companies or if I’m wasting my time by going through the application process. The bar is so low to meet expectations at my current company that I worry it’s made me soft/lazy/unattractive to more prestigious employers. I don’t want to get into a senior or staff interview and make an ass out of myself. What are your thoughts?
Thank you!
https://redd.it/1l8yyie
@r_devops
Hey all,
I’ve been working full-time for over 8 years at the same Fortune 500 non-tech company (and interned at a different one prior to that), but I’m finally ready to look elsewhere because of being what I perceive as underpaid relative to the value I can create. Here’s my anonymized resume:
https://imgur.com/a/nd3T1MA
I’ve been in 4 different organizations within the company, but I can’t tell whether I am actually going to get looks at FAANG-adjacent companies or if I’m wasting my time by going through the application process. The bar is so low to meet expectations at my current company that I worry it’s made me soft/lazy/unattractive to more prestigious employers. I don’t want to get into a senior or staff interview and make an ass out of myself. What are your thoughts?
Thank you!
https://redd.it/1l8yyie
@r_devops
Imgur
Discover the magic of the internet at Imgur, a community powered entertainment destination. Lift your spirits with funny jokes, trending memes, entertaining gifs, inspiring stories, viral videos, and so much more from users.
Change Log Creation
I added a step to my build process to generate a Changlog by using the commit messages by date before the last tag. Now facing an interesting decisión and want to get some suggestions. I can call the change log build task when I generate the release (on GitHub) and only make it part of the release. That’s option 1. Option 2, generate the change log on build and commit it back to the repository as part of the build process. I am not thrilled with either option but I want to make this as easy as possible, but it Alfredo’s dirty to commit as part of the build. I can do this as a pre-commit hook as well, not sure if that’s better but it will require some setup on the dev machine. What are you folks doing in a similar scenario? This is part of a generic build agent/pipline, I think I posted it on here already.
https://redd.it/1l8z1q2
@r_devops
I added a step to my build process to generate a Changlog by using the commit messages by date before the last tag. Now facing an interesting decisión and want to get some suggestions. I can call the change log build task when I generate the release (on GitHub) and only make it part of the release. That’s option 1. Option 2, generate the change log on build and commit it back to the repository as part of the build process. I am not thrilled with either option but I want to make this as easy as possible, but it Alfredo’s dirty to commit as part of the build. I can do this as a pre-commit hook as well, not sure if that’s better but it will require some setup on the dev machine. What are you folks doing in a similar scenario? This is part of a generic build agent/pipline, I think I posted it on here already.
https://redd.it/1l8z1q2
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Airflow: how to reload webserver_config.py without restarting the webserver?
I tried making edits to the config file but that doesn’t get picked up. Using airflow 2. Surely there must be a way to reload without restarting the pod?
https://redd.it/1l8yl06
@r_devops
I tried making edits to the config file but that doesn’t get picked up. Using airflow 2. Surely there must be a way to reload without restarting the pod?
https://redd.it/1l8yl06
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Cloud DevOps mentorship/tutoring needed
Background
I am a msc it security student in Germany and btech computer science graduate from india, with multiple internship experience with full stack web dev. I have completed some course on docker and AWS cloud practitioner.
Expectations
I will complete my first year of msc in 3 more months after which I need to land job with a company to do my master thesis along with the company. I want to do it specifically in the intersection of cloud DevOps and security.
Requirement
I am looking for experienced cloud DevOps engineer (at least 1 years), who can get me interview ready to land a job for such roles. I only have 3 months to land a job so the duration of the contract will also be 3 months. I specifically want to learn in depth about Kubernetes, observability and infrastructure as code (terraform).
Bonus
If someone also can teach me potential security aspects of cloud DevOps and a potential master thesis in this field that would very beneficial for me.
Pay: up to 12 euro per hour
https://redd.it/1l93ej2
@r_devops
Background
I am a msc it security student in Germany and btech computer science graduate from india, with multiple internship experience with full stack web dev. I have completed some course on docker and AWS cloud practitioner.
Expectations
I will complete my first year of msc in 3 more months after which I need to land job with a company to do my master thesis along with the company. I want to do it specifically in the intersection of cloud DevOps and security.
Requirement
I am looking for experienced cloud DevOps engineer (at least 1 years), who can get me interview ready to land a job for such roles. I only have 3 months to land a job so the duration of the contract will also be 3 months. I specifically want to learn in depth about Kubernetes, observability and infrastructure as code (terraform).
Bonus
If someone also can teach me potential security aspects of cloud DevOps and a potential master thesis in this field that would very beneficial for me.
Pay: up to 12 euro per hour
https://redd.it/1l93ej2
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What do you use to automate self-healing scripts?
Hey everyone! just asking this to see if I'm missing something or the hereditary blindness already got me.
The thing is, I've been a DevOps engineer for about 5–6 years in two different companies, and in both of them, my main task was creating auto-remediation/self-healing scripts that run automatically when a monitoring tool detects something, like a spike in CPU, swap usage, low disk space, and so.
For that whole pipeline, I've been using a mix of Python/Go/Shell (sensible scripts), orchestrated by Rundeck/Jenkins/n8n/Tower as the executors, and Grafana/Datadog or similar tools for monitoring.
So my question is: is there anything dedicated to this? I mean, a tool that, when a monitoring metric hits a threshold, can automatically trigger something on a machine or group of machines?
https://redd.it/1l956jb
@r_devops
Hey everyone! just asking this to see if I'm missing something or the hereditary blindness already got me.
The thing is, I've been a DevOps engineer for about 5–6 years in two different companies, and in both of them, my main task was creating auto-remediation/self-healing scripts that run automatically when a monitoring tool detects something, like a spike in CPU, swap usage, low disk space, and so.
For that whole pipeline, I've been using a mix of Python/Go/Shell (sensible scripts), orchestrated by Rundeck/Jenkins/n8n/Tower as the executors, and Grafana/Datadog or similar tools for monitoring.
So my question is: is there anything dedicated to this? I mean, a tool that, when a monitoring metric hits a threshold, can automatically trigger something on a machine or group of machines?
https://redd.it/1l956jb
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Developer cheat sheet
I created this free cheat sheet for cli commands.
I tend to prefer to invoke commands in my IDE vs GUI.
This is free.
If there is anything you want me to add please let me know.
Https://devcheatsheet.io
https://redd.it/1l95236
@r_devops
I created this free cheat sheet for cli commands.
I tend to prefer to invoke commands in my IDE vs GUI.
This is free.
If there is anything you want me to add please let me know.
Https://devcheatsheet.io
https://redd.it/1l95236
@r_devops
devcheatsheet.io
Dev Cheatsheets
One place for all your cheatsheets
Automate adding vCluster to Argo CD using External Secrets Operator - GitOps
A blog post about how to automate provisioning virtual clusters (vCluster) using External Secrets Operator. Basically, when vCluster is created, it will be added automatically to Argo CD using External Secrets
Automate adding vCluster to Argo CD using External Secrets Operator
Enjoy :-)
https://redd.it/1l973i6
@r_devops
A blog post about how to automate provisioning virtual clusters (vCluster) using External Secrets Operator. Basically, when vCluster is created, it will be added automatically to Argo CD using External Secrets
PushSecret and ClusterSecretStore.Automate adding vCluster to Argo CD using External Secrets Operator
Enjoy :-)
https://redd.it/1l973i6
@r_devops
Ahmed AbouZaid!
Automate adding vCluster to Argo CD using External Secrets Operator - GitOps
Overview In KubeZero (an open-source out-of-the-box Platform Orchestrator with GitOps designed for multi-environment Clo...
Best way to structure a new Azure DevOps pipeline for Playwright tests?
Hi everyone, I could use some help structuring a test pipeline in Azure DevOps using Playwright. My team used to work with Cypress, but we’re currently migrating to Playwright. The thing is, we never had a dedicated pipeline for automated tests, only build and deploy pipelines for the dev team, which were recently moved to another Azure DevOps project.
Now we want to create a separate pipeline specifically for testing, and I’m unsure of the best approach: should I create a brand-new YAML file just for the Playwright tests? Or try to reuse the old pipeline structure (even though it’s from another project and wasn’t built for testing in the first place)?
I’m looking for advice on what would be the best practice here, especially in terms of long-term organization and maintainability. If anyone has been through a similar migration, I’d really appreciate your insights. Thanks!
*E2E tests
https://redd.it/1l984wd
@r_devops
Hi everyone, I could use some help structuring a test pipeline in Azure DevOps using Playwright. My team used to work with Cypress, but we’re currently migrating to Playwright. The thing is, we never had a dedicated pipeline for automated tests, only build and deploy pipelines for the dev team, which were recently moved to another Azure DevOps project.
Now we want to create a separate pipeline specifically for testing, and I’m unsure of the best approach: should I create a brand-new YAML file just for the Playwright tests? Or try to reuse the old pipeline structure (even though it’s from another project and wasn’t built for testing in the first place)?
I’m looking for advice on what would be the best practice here, especially in terms of long-term organization and maintainability. If anyone has been through a similar migration, I’d really appreciate your insights. Thanks!
*E2E tests
https://redd.it/1l984wd
@r_devops
Reddit
Best way to structure a new Azure DevOps pipeline for Playwright tests? : r/devops
407K subscribers in the devops community.
Secure s3 dashboard/website
Hi everyone. I am loosing my mind over what seems to be a simple problem.
So basically, I created internal dashboard (website stored in private s3). I have internal route53 record to use with it if needed, and internal ALB.
What i can't figure out is how to restrict access to it to only users behind the VPN. I tried CloudFront but the problem is that VPN uses split tunnel and public IP doesn't change, so WAF, lambdas, etc do not work.
What are my options to control access to this dashboard to selected users (preferably ones behind VPN without extra layers to login)
https://redd.it/1l9cd6d
@r_devops
Hi everyone. I am loosing my mind over what seems to be a simple problem.
So basically, I created internal dashboard (website stored in private s3). I have internal route53 record to use with it if needed, and internal ALB.
What i can't figure out is how to restrict access to it to only users behind the VPN. I tried CloudFront but the problem is that VPN uses split tunnel and public IP doesn't change, so WAF, lambdas, etc do not work.
What are my options to control access to this dashboard to selected users (preferably ones behind VPN without extra layers to login)
https://redd.it/1l9cd6d
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Ode to the sysAdmin
Did the world forget that Systems Administrators existed before heirachical power structures?
- Customer support
- Engineer
- Architect
The architect’s role is to understand the shape of the bridge the customer needs, and the engineer builds the bridge.
If an Architect is expected to play Engineer, asked to build the bridge, whilst others were sabotaging the structure, who’s at fault?
The Architect?
The Engineer?
The 400 other people between,
Or the customer, which isn’t one, but many.
Please, think about that for a second.
A Domain Admin can never be asked to unsee what’s been seen.
We make sure others hold the same responsibility with the same honor, hoping that somewhere along the chain takes up enough of the slack to keep it together.
Systems Engineering isn’t easy.
Complex-Systems Architecture isn’t hard.
Meet me in the middle; or help me build the bridge.
https://redd.it/1l9f98n
@r_devops
Did the world forget that Systems Administrators existed before heirachical power structures?
- Customer support
- Engineer
- Architect
The architect’s role is to understand the shape of the bridge the customer needs, and the engineer builds the bridge.
If an Architect is expected to play Engineer, asked to build the bridge, whilst others were sabotaging the structure, who’s at fault?
The Architect?
The Engineer?
The 400 other people between,
Or the customer, which isn’t one, but many.
Please, think about that for a second.
A Domain Admin can never be asked to unsee what’s been seen.
We make sure others hold the same responsibility with the same honor, hoping that somewhere along the chain takes up enough of the slack to keep it together.
Systems Engineering isn’t easy.
Complex-Systems Architecture isn’t hard.
Meet me in the middle; or help me build the bridge.
https://redd.it/1l9f98n
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Need a config management solution for structured per-item folders
I’m building a Python service that monitors various IoT devices (e.g., industrial motors, cold storage units).
Each monitored device has its own folder with all of its configuration inside:
A `.config` file with runtime parameters
A
A `description.txt` file that explains what this device does and how it's monitored
Here is the simplified folder strucure:
`project/`
`├──` [`main.py`](https://main.py)
`├──` [`loader.py`](https://loader.py)
`├── devices/`
`│ ├── fridge_a/`
`│ │ ├── config.config`
`│ │ ├── schema.json`
`│ │ └── description.txt`
`│ ├── motor_5/`
`│ │ ├── config.config`
`│ │ ├── schema.json`
`│ │ └── description.txt`
`│ └── ...`
What I’m Looking For:
A web interface to create/edit/delete these device folders
Ability to store and manage `.config`, `schema.json`, and `description.txt`
A backend (self-hosted or cloud) my Python service can query to fetch this config at runtime
https://redd.it/1l9hr4a
@r_devops
I’m building a Python service that monitors various IoT devices (e.g., industrial motors, cold storage units).
Each monitored device has its own folder with all of its configuration inside:
A `.config` file with runtime parameters
A
schema.json file describing the expected sensor inputA `description.txt` file that explains what this device does and how it's monitored
Here is the simplified folder strucure:
`project/`
`├──` [`main.py`](https://main.py)
`├──` [`loader.py`](https://loader.py)
`├── devices/`
`│ ├── fridge_a/`
`│ │ ├── config.config`
`│ │ ├── schema.json`
`│ │ └── description.txt`
`│ ├── motor_5/`
`│ │ ├── config.config`
`│ │ ├── schema.json`
`│ │ └── description.txt`
`│ └── ...`
What I’m Looking For:
A web interface to create/edit/delete these device folders
Ability to store and manage `.config`, `schema.json`, and `description.txt`
A backend (self-hosted or cloud) my Python service can query to fetch this config at runtime
https://redd.it/1l9hr4a
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How to trigger AWS CodeBuild only once after multiple S3 uploads (instead of per file)?
I'm trying to achieve the same functionality as discussed in this AWS Re:Post thread:
https://repost.aws/questions/QUgL-q5oT2TFOlY6tJJr4nSQ/multiple-uploads-to-s3-trigger-the-lambda-multiple-times
However, the article referenced in that thread either no longer works or doesn't provide enough detail to implement a working solution. Does anyone know of a good article, AWS blog, or official documentation that explains how to handle this scenario properly?
P.S. Here's my exact use case:
I'm working on a project where an AWS CodeBuild project scans files in an S3 bucket using ClamAV. If an infected file is detected, it's removed from the source bucket and moved to a quarantine bucket.
The problem I'm facing is this:
When multiple files (say, 10 files) are uploaded at once to the S3 bucket, I don’t want to trigger the scanning process (via CodeBuild) 10 separate times—just once when all the files are fully uploaded.
As far as I understand, S3 does not directly trigger CodeBuild. So the plan is:
S3 triggers a Lambda function (possibly via SQS),
Lambda then triggers the CodeBuild project after determining that all required files are uploaded.
But I’d love suggestions or working patterns that others have implemented successfully in production for similar "batch upload detection" problems.
https://redd.it/1l9j2gg
@r_devops
I'm trying to achieve the same functionality as discussed in this AWS Re:Post thread:
https://repost.aws/questions/QUgL-q5oT2TFOlY6tJJr4nSQ/multiple-uploads-to-s3-trigger-the-lambda-multiple-times
However, the article referenced in that thread either no longer works or doesn't provide enough detail to implement a working solution. Does anyone know of a good article, AWS blog, or official documentation that explains how to handle this scenario properly?
P.S. Here's my exact use case:
I'm working on a project where an AWS CodeBuild project scans files in an S3 bucket using ClamAV. If an infected file is detected, it's removed from the source bucket and moved to a quarantine bucket.
The problem I'm facing is this:
When multiple files (say, 10 files) are uploaded at once to the S3 bucket, I don’t want to trigger the scanning process (via CodeBuild) 10 separate times—just once when all the files are fully uploaded.
As far as I understand, S3 does not directly trigger CodeBuild. So the plan is:
S3 triggers a Lambda function (possibly via SQS),
Lambda then triggers the CodeBuild project after determining that all required files are uploaded.
But I’d love suggestions or working patterns that others have implemented successfully in production for similar "batch upload detection" problems.
https://redd.it/1l9j2gg
@r_devops
Reddit
From the devops community on Reddit: How to trigger AWS CodeBuild only once after multiple S3 uploads (instead of per file)?
Explore this post and more from the devops community
Projects for resume
Hi folks.
I have 2 yoe in IT and I want to proceed in devops. Now I have theory and a little hands on on devops tools like jenkins, ansible, docker, k8s. I have also taken some random codes from chatgpt and built their docker images using jenkins and applied k8s deployment in them.
So now I wanted to know if I can add these in my project or not?
Also if I want to contribute in open source then how to search regarding same?
Would also love to know if you can help me to know about some other project ideas.
https://redd.it/1l9ke1a
@r_devops
Hi folks.
I have 2 yoe in IT and I want to proceed in devops. Now I have theory and a little hands on on devops tools like jenkins, ansible, docker, k8s. I have also taken some random codes from chatgpt and built their docker images using jenkins and applied k8s deployment in them.
So now I wanted to know if I can add these in my project or not?
Also if I want to contribute in open source then how to search regarding same?
Would also love to know if you can help me to know about some other project ideas.
https://redd.it/1l9ke1a
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Anyone switch from Python to Golang for most of their day-to-day tasks?
I'm in a situation where there's a lot of teams that each use different Linux distributions and dealing with Python dependencies, venvs, etc... is becoming a royal PITA.
https://redd.it/1l9lqdm
@r_devops
I'm in a situation where there's a lot of teams that each use different Linux distributions and dealing with Python dependencies, venvs, etc... is becoming a royal PITA.
https://redd.it/1l9lqdm
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How can I create a clear SBOM output for my applications?
I am new to this community and currently looking for a way to creating a SBOM on my Windows systems and then scanning for security vulnerabilities. My goal is to get a consolidated block per application in the terminal, so not one line per CVE, but all the information (similiar like a winget view) grouped together per application. This way, you can quickly see which application needs to be updated instead of having to search around. Additionally, this should also be displayed as a list in the terminal.
So far I have tried syft + grype
Maybe someone can help me here, thanks in advance :)
https://redd.it/1l9nrvq
@r_devops
I am new to this community and currently looking for a way to creating a SBOM on my Windows systems and then scanning for security vulnerabilities. My goal is to get a consolidated block per application in the terminal, so not one line per CVE, but all the information (similiar like a winget view) grouped together per application. This way, you can quickly see which application needs to be updated instead of having to search around. Additionally, this should also be displayed as a list in the terminal.
So far I have tried syft + grype
Maybe someone can help me here, thanks in advance :)
https://redd.it/1l9nrvq
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Opsgenie shutting down, looking for replacement. Suggestions?
Opsgenie will be ending its service in 2027. We want to find a good replacement soon so we have enough time to choose carefully and not rush last minute. Does anyone have recommendations for other tools we should consider?
Here's what we mainly use Opsgenie for:
* Checking who is on call and directing calls from our VOIP system to the right person, using a webhook from our VOIP provider. We’d prefer a tool that has built-in on-call scheduling and works well with 3CX. If it doesn’t support 3CX, options like Twilio or other providers are okay.
* Sending alerts to people when they are on call.
* Notifying team members if a service goes down, based on alerts from tools like Pingdom or other monitoring services.
* Creating and managing work schedules.
* Temporarily changing schedules (for example, if someone is taking time off or is sick).
So far, I’ve checked out Incident.io, Pagertree.com, and Firehydrant (which is way too costly). Do you have any other suggestions we should look into? Right now, our team is small—just four people handling on-call duties and standby SLA —but we might grow in the future.
https://redd.it/1l9o0e6
@r_devops
Opsgenie will be ending its service in 2027. We want to find a good replacement soon so we have enough time to choose carefully and not rush last minute. Does anyone have recommendations for other tools we should consider?
Here's what we mainly use Opsgenie for:
* Checking who is on call and directing calls from our VOIP system to the right person, using a webhook from our VOIP provider. We’d prefer a tool that has built-in on-call scheduling and works well with 3CX. If it doesn’t support 3CX, options like Twilio or other providers are okay.
* Sending alerts to people when they are on call.
* Notifying team members if a service goes down, based on alerts from tools like Pingdom or other monitoring services.
* Creating and managing work schedules.
* Temporarily changing schedules (for example, if someone is taking time off or is sick).
So far, I’ve checked out Incident.io, Pagertree.com, and Firehydrant (which is way too costly). Do you have any other suggestions we should look into? Right now, our team is small—just four people handling on-call duties and standby SLA —but we might grow in the future.
https://redd.it/1l9o0e6
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Just spent 2 hours looking for feature specs that were 'somewhere'... again
Been working on the same web service for 3 years. Today I needed to update a feature and literally spent 2 hours searching for the latest API documentation. Went through Google Drive, Notion, GitHub, Slack threads, old emails...
Finally found it in a spreadsheet linked in a 6-month-old Slack message. The "official" documentation in Notion was created 3 years ago when the feature was first built and hasn't been updated since - none of the recent changes were documented.
Anyone else dealing with this documentation chaos? When teams use different tools and nobody knows who has what information. Documents get created and then abandoned, and no one can tell what's current anymore. How do you find the right information in situations like this:
Dev team uses GitHub and Notion
PMs use spreadsheets and Google Docs
Customer support uses spreadsheets and Google Docs
Design team uses Figma comments
https://redd.it/1l9mjdl
@r_devops
Been working on the same web service for 3 years. Today I needed to update a feature and literally spent 2 hours searching for the latest API documentation. Went through Google Drive, Notion, GitHub, Slack threads, old emails...
Finally found it in a spreadsheet linked in a 6-month-old Slack message. The "official" documentation in Notion was created 3 years ago when the feature was first built and hasn't been updated since - none of the recent changes were documented.
Anyone else dealing with this documentation chaos? When teams use different tools and nobody knows who has what information. Documents get created and then abandoned, and no one can tell what's current anymore. How do you find the right information in situations like this:
Dev team uses GitHub and Notion
PMs use spreadsheets and Google Docs
Customer support uses spreadsheets and Google Docs
Design team uses Figma comments
https://redd.it/1l9mjdl
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Is CPU utilisation the only thing it matters when it comes to performance?
I work with a lot of dev teams and we keep getting told to scale up when the CPU (or some other hardware metrics) utilisation is approaching 100%.
I can't help but keep thinking back then when I used to game a lot, having a better hardware meant higher performance in terms of FPS, and that older hardware could have utilisation not reaching 100% but still has low FPS.
I can't understand why they don't focus on the end result metrics rather than hardware metrics.
Or did I get all of this wrong? I don't deal with app teams directly, so I have no idea about their apps, I just deploy it and maintain the infra around it.
https://redd.it/1l9qedh
@r_devops
I work with a lot of dev teams and we keep getting told to scale up when the CPU (or some other hardware metrics) utilisation is approaching 100%.
I can't help but keep thinking back then when I used to game a lot, having a better hardware meant higher performance in terms of FPS, and that older hardware could have utilisation not reaching 100% but still has low FPS.
I can't understand why they don't focus on the end result metrics rather than hardware metrics.
Or did I get all of this wrong? I don't deal with app teams directly, so I have no idea about their apps, I just deploy it and maintain the infra around it.
https://redd.it/1l9qedh
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Stop the madness: DevOps trends that are ruining teams in 2025
Okay I need to vent. Been doing DevOps for 10 years and I'm losing my mind watching teams chase every shiny new trend.
Just consulted with a startup that has TWELVE microservices for a todo app. Twelve! They have more services than active users. Their deployment process is longer than my morning commute and fails about as often.
And don't get me started on the team that spent half a year setting up Kubernetes to run 3 PHP apps that get maybe 100 requests per day. The operational overhead costs more than just running the damn things on a single EC2 instance.
But the thing that broke me? Production database running out of space, one-line config fix needed, but had to wait 45 minutes for the GitOps workflow. Database died after 20 minutes.
Sometimes you just need to SSH into the server and change a value. I said it. Fight me.
Hot take: most of the "successful" teams I work with are actually pretty boring. They pick proven tech, keep architectures simple, and spend time building features instead of rebuilding their infrastructure every quarter.
Anyway, wrote a whole rant about this stuff: https://medium.com/@heinancabouly/devops-trends-that-need-to-die-in-2025-please-for-the-love-of-all-that-is-holy-22cbbadf2db3?source=friends\_link&sk=3f2bbe0844a62291eefd787da978ef53
Anyone else tired of this madness or is it just me getting old?
https://redd.it/1l9t7mb
@r_devops
Okay I need to vent. Been doing DevOps for 10 years and I'm losing my mind watching teams chase every shiny new trend.
Just consulted with a startup that has TWELVE microservices for a todo app. Twelve! They have more services than active users. Their deployment process is longer than my morning commute and fails about as often.
And don't get me started on the team that spent half a year setting up Kubernetes to run 3 PHP apps that get maybe 100 requests per day. The operational overhead costs more than just running the damn things on a single EC2 instance.
But the thing that broke me? Production database running out of space, one-line config fix needed, but had to wait 45 minutes for the GitOps workflow. Database died after 20 minutes.
Sometimes you just need to SSH into the server and change a value. I said it. Fight me.
Hot take: most of the "successful" teams I work with are actually pretty boring. They pick proven tech, keep architectures simple, and spend time building features instead of rebuilding their infrastructure every quarter.
Anyway, wrote a whole rant about this stuff: https://medium.com/@heinancabouly/devops-trends-that-need-to-die-in-2025-please-for-the-love-of-all-that-is-holy-22cbbadf2db3?source=friends\_link&sk=3f2bbe0844a62291eefd787da978ef53
Anyone else tired of this madness or is it just me getting old?
https://redd.it/1l9t7mb
@r_devops
Medium
DevOps Trends That Need to Die in 2025 (Please, For the Love of All That Is Holy)
It’s 2025. Kubernetes is 11 years old. Docker is ancient by tech standards. Your CI/CD pipeline has more stages than a Saturn V rocket. And…