Should we use Grafana open source in a medium company
I work at a medium-sized company using New Relic for observability. We ingest over 80GB of data monthly, run 20+ services across production and staging, and use MongoDB. While New Relic covers logs, metrics, traces and MongoDB well, it’s getting too expensive.
We’re considering switching to Grafana, Prometheus, and OpenTelemetry to handle all our monitoring needs, including MongoDB. But setting up Grafana has been a lot of manual work. There aren’t many good, maintained open-source dashboards—especially for MongoDB—and building them from scratch takes time.
I also read that as data and dashboards grow, Grafana can slow down and require more powerful machines, which adds cost and complexity. That makes us question if it’s worth switching. For a medium-sized company, is moving to open source really viable, or are the long-term setup and maintenance costs just as high?
Is anyone running Grafana OSS at scale? Does it handle large volumes well in practice?
https://redd.it/1kcz9e5
@r_devops
I work at a medium-sized company using New Relic for observability. We ingest over 80GB of data monthly, run 20+ services across production and staging, and use MongoDB. While New Relic covers logs, metrics, traces and MongoDB well, it’s getting too expensive.
We’re considering switching to Grafana, Prometheus, and OpenTelemetry to handle all our monitoring needs, including MongoDB. But setting up Grafana has been a lot of manual work. There aren’t many good, maintained open-source dashboards—especially for MongoDB—and building them from scratch takes time.
I also read that as data and dashboards grow, Grafana can slow down and require more powerful machines, which adds cost and complexity. That makes us question if it’s worth switching. For a medium-sized company, is moving to open source really viable, or are the long-term setup and maintenance costs just as high?
Is anyone running Grafana OSS at scale? Does it handle large volumes well in practice?
https://redd.it/1kcz9e5
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Is OpenTelemetry ready to monitor my (and your) infra today?
OpenTelemetry has come a long way in the context of distributed tracing and also provides crazy correlation level with logs, traces and metrics. But OTel as a project has been growing and is way more powerful than just doing distributed tracing today.
The awareness around OTel for infra monitoring is very less. Folks mostly use prometheus, which is great, but if you are using OTel for traces, logs etc - maybe you should give it a shot for infra monitoring as well.
That said, OTel for infra is still expanding with new receivers etc being added.
As a medium to spread awareness on this, and to help anyone looking for a shift from prom or already using OTel trying to decrease the silos, I wrote a blog that broadly discusses,
1/ how you can use OTel for monitoring your VMs, K8s clusters and pods easily
2/ if OTel is ready to monitor your infra
3/ how to switch to OTel from Prometheus [pretty easy with the prometheus receiver\]
Link to the blog here
https://redd.it/1kcye6b
@r_devops
OpenTelemetry has come a long way in the context of distributed tracing and also provides crazy correlation level with logs, traces and metrics. But OTel as a project has been growing and is way more powerful than just doing distributed tracing today.
The awareness around OTel for infra monitoring is very less. Folks mostly use prometheus, which is great, but if you are using OTel for traces, logs etc - maybe you should give it a shot for infra monitoring as well.
That said, OTel for infra is still expanding with new receivers etc being added.
As a medium to spread awareness on this, and to help anyone looking for a shift from prom or already using OTel trying to decrease the silos, I wrote a blog that broadly discusses,
1/ how you can use OTel for monitoring your VMs, K8s clusters and pods easily
2/ if OTel is ready to monitor your infra
3/ how to switch to OTel from Prometheus [pretty easy with the prometheus receiver\]
Link to the blog here
https://redd.it/1kcye6b
@r_devops
SigNoz
Is OpenTelemetry ready for Infra Monitoring?
OpenTelemetry has made infratsructure monitoring easy to get started with and comes with options for kubernetes cluster and pod monitoring as well. OpenTelemetry also makes it possible to achieve correlation with application monitoring as well.
AWS SAA-C03 Exam Traps That Almost Failed Me (And How to Dodge Them)
Hello comrades!
I cleared my AWS SAA exam recently and made an article about my journey and what common pitfalls to avoid :)
I hope this helps anyone who's planning to take up the examination soon :)
Please feel to add anything I might have missed :)
https://medium.com/@nageshrajcodes/aws-saa-c03-exam-traps-that-almost-failed-me-and-how-to-dodge-them-08c41ed73e2a?sk=cea7f9606ce910a723b4064b2a48c8d9
I wish you all the very best :')
Thank you :)
https://redd.it/1kd0ghv
@r_devops
Hello comrades!
I cleared my AWS SAA exam recently and made an article about my journey and what common pitfalls to avoid :)
I hope this helps anyone who's planning to take up the examination soon :)
Please feel to add anything I might have missed :)
https://medium.com/@nageshrajcodes/aws-saa-c03-exam-traps-that-almost-failed-me-and-how-to-dodge-them-08c41ed73e2a?sk=cea7f9606ce910a723b4064b2a48c8d9
I wish you all the very best :')
Thank you :)
https://redd.it/1kd0ghv
@r_devops
Medium
AWS SAA-C03 Exam Traps That Almost Failed Me (And How to Dodge Them)
I scored 825/1000 on my AWS SAA-C03 exam — but only after falling face-first into every trap AWS could throw at me. Here’s how to avoid…
Help creating a whatsapp bot
Hi, im trying to create a bot for my company that grabs files from a sharepoint folder and sends them through whatsapp when asked. i have 0 experience, whats the easiest way to do it? my job kind of depends on this
edit* i can use only copilot IA, for privacy policies
https://redd.it/1kd2t6z
@r_devops
Hi, im trying to create a bot for my company that grabs files from a sharepoint folder and sends them through whatsapp when asked. i have 0 experience, whats the easiest way to do it? my job kind of depends on this
edit* i can use only copilot IA, for privacy policies
https://redd.it/1kd2t6z
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Which DevOps repositories need contributions?
I don't think I am the only one that has a little bit of a spare time in their life and would love to help out on a DevOps project in need.
What are your favorite ones? Which repositories need just a little bit more love, whether writing documentation, improving runtime or adding features?
https://redd.it/1kd41pq
@r_devops
I don't think I am the only one that has a little bit of a spare time in their life and would love to help out on a DevOps project in need.
What are your favorite ones? Which repositories need just a little bit more love, whether writing documentation, improving runtime or adding features?
https://redd.it/1kd41pq
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Thoughts on asdf
I ran into this tool a few years back and didn't give it much thought (I ended using pyenv at that time)
But now I am juggling a few projects that require different versions for different things. Enter asdf. It is not ultra intuitive but in a nutshell:
1. list and get the plugins you need
2. list and install the versions you need
3. set the required versions for your project
You can use it to build images in CI. Talk to databases of different version. Install pesky tools that require a specific version of Python. The world is your oyster.
If you haven't tried it, I highly recommend it. If you are new/junior, definitely learn it!
Question to the seniors: Do you use asdf? Any alternatives? Cautionary tales? Suggestions?
https://redd.it/1kd4m8y
@r_devops
I ran into this tool a few years back and didn't give it much thought (I ended using pyenv at that time)
But now I am juggling a few projects that require different versions for different things. Enter asdf. It is not ultra intuitive but in a nutshell:
1. list and get the plugins you need
2. list and install the versions you need
3. set the required versions for your project
You can use it to build images in CI. Talk to databases of different version. Install pesky tools that require a specific version of Python. The world is your oyster.
If you haven't tried it, I highly recommend it. If you are new/junior, definitely learn it!
Question to the seniors: Do you use asdf? Any alternatives? Cautionary tales? Suggestions?
https://redd.it/1kd4m8y
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you manage upgrades in a multi-tenant environment where every team does their own thing and "dev downtime" is treated like a production outage?
We support dozens of tenant teams (with more being added every quarter), each running multiple apps with wildly different languages, package versions, and levels of testing. There's very little standardization, and even where we're able to create some, inevitably some team comes along with a requirement and leadership authorizes a one-off alternatively deployed solution with little thought given to the long term maintenance and suitability of said solution. The org's mantra is "don't get in the developers' way," which often ends up meaning: no enforcement, very few guardrails, and no appetite for upgrades or maintenance work that might introduce any friction.
Our platform team is just two people (down from seven a year ago), responsible for everything from cost savings to network improvements to platform upgrades. What happens, over and over again, is this:
1. We test an upgrade thoroughly against our own infrastructure apps and roll it out.
2. Some tenant apps break—often because they're using ancient libraries, make assumptions about networking, or haven’t been tested in years.
3. We get blamed, the upgrade gets rolled back, and now we're on the hook to fix it.
4. We try to schedule time with the tenant teams to reproduce issues in a lower environment, but even their "dev" environments are treated like production. Any interruption is considered "blocking development."
5. Scheduling across dozens of tenants takes weeks or months. The upgrade gets deprioritized as "too expensive" in terms of engineer hours. We get a new top-down initiative and the last one is dropped into tech debt purgatory.
6. A few months later, we try again—but now we have even more tenants and more variables. Rinse and repeat.
It’s exhausting. We’re barely keeping the lights on, constantly writing docs and tickets for upgrades we never actually deliver. Meanwhile, many of these tenant teams have been around for a decade and are just migrating onto our systems. Leadership has promised them we won’t “get in their way,” which leaves us with zero leverage to enforce even basic testing or compatibility standards.
We’re stuck between being responsible for reliability and improvement… and having no authority to actually enforce the practices that would lead to either.
How do you manage upgrades in environments like this? Is there a way out of this loop, or is the answer just "wait for enough systems to break that someone finally cares"?
https://redd.it/1kd6srk
@r_devops
We support dozens of tenant teams (with more being added every quarter), each running multiple apps with wildly different languages, package versions, and levels of testing. There's very little standardization, and even where we're able to create some, inevitably some team comes along with a requirement and leadership authorizes a one-off alternatively deployed solution with little thought given to the long term maintenance and suitability of said solution. The org's mantra is "don't get in the developers' way," which often ends up meaning: no enforcement, very few guardrails, and no appetite for upgrades or maintenance work that might introduce any friction.
Our platform team is just two people (down from seven a year ago), responsible for everything from cost savings to network improvements to platform upgrades. What happens, over and over again, is this:
1. We test an upgrade thoroughly against our own infrastructure apps and roll it out.
2. Some tenant apps break—often because they're using ancient libraries, make assumptions about networking, or haven’t been tested in years.
3. We get blamed, the upgrade gets rolled back, and now we're on the hook to fix it.
4. We try to schedule time with the tenant teams to reproduce issues in a lower environment, but even their "dev" environments are treated like production. Any interruption is considered "blocking development."
5. Scheduling across dozens of tenants takes weeks or months. The upgrade gets deprioritized as "too expensive" in terms of engineer hours. We get a new top-down initiative and the last one is dropped into tech debt purgatory.
6. A few months later, we try again—but now we have even more tenants and more variables. Rinse and repeat.
It’s exhausting. We’re barely keeping the lights on, constantly writing docs and tickets for upgrades we never actually deliver. Meanwhile, many of these tenant teams have been around for a decade and are just migrating onto our systems. Leadership has promised them we won’t “get in their way,” which leaves us with zero leverage to enforce even basic testing or compatibility standards.
We’re stuck between being responsible for reliability and improvement… and having no authority to actually enforce the practices that would lead to either.
How do you manage upgrades in environments like this? Is there a way out of this loop, or is the answer just "wait for enough systems to break that someone finally cares"?
https://redd.it/1kd6srk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Memcached Docker Images (as small as 124 KB!) – Feedback Wanted
I wanted to share a project I’ve been working on: a suite of Docker images for Memcached 1.6.38 that I’ve stripped down to the bare minimum—optimized specifically for containerized environments. These images are scratch-based, TCP-only, and fully configurable using environment variables via patched code(no CLI args needed, but still supported).
Thanks.
🔗 GitHub: https://github.com/johnnyjoy/memcached-docker
🔗 Docker Hub: https://hub.docker.com/r/tigersmile/memcached
https://redd.it/1kd6quk
@r_devops
I wanted to share a project I’ve been working on: a suite of Docker images for Memcached 1.6.38 that I’ve stripped down to the bare minimum—optimized specifically for containerized environments. These images are scratch-based, TCP-only, and fully configurable using environment variables via patched code(no CLI args needed, but still supported).
Thanks.
🔗 GitHub: https://github.com/johnnyjoy/memcached-docker
🔗 Docker Hub: https://hub.docker.com/r/tigersmile/memcached
https://redd.it/1kd6quk
@r_devops
GitHub
GitHub - johnnyjoy/memcached-docker: Dockerized memcached 393kb on AMD64
Dockerized memcached 393kb on AMD64. Contribute to johnnyjoy/memcached-docker development by creating an account on GitHub.
Interview for associate devops role, not sure how it went, need opinions
I had a technical discussion with with a smaller company(around 100-200 employees) and they are filling out a new devops team. I have 7 YOE at large tech companies as a software engineer, but my duties have closer aligned with sys admin, infrastructure, Linux admin, developer, kinda devops, or just whatever is needed. I always wanted to do devops but haven't had the opportunity to pivot. I got an interview at this place who has had this listing up for over a month for an associate devops engineer for the same salary. The recruiter seemed very excited to meet me and I was excited for this job
I had the technical interview yesterday and the first half was asking me my technical experience with CI/CD tools and cloud environments. I tired to answer what I could but told them I was lacking in this area and have always wanted to learn it which is why I am so excited for this associate position. I understand the concepts of the tools and have interacted with them so I could explain them, but I don’t have deep hands on. When they asked me more in depth scripting questions I may have been a little shaky, but eventually came to the correct answer they were looking for.
Then it was the linux infrastructure guys turn who works on infrastructure within the team and he started shotgunning me system level questions that I was able to answer immediately and knew were right. The back and forth continued about 5-7 minutes before he said "okay I think im good" and went back to the main guy who asked me how id troubleshoot an issue. I talked out my thought process and isolated every point of failure and explained the testing for each point, and mentioned system level linux commands that could be used to troubleshoot this and went deeper into checking firewalls and such. After a bit he asked if I couldn’t find anything there what would I do, and I said Id reach out to teams I know who may interact with this application and ask if any major changes have been pushed out recently that may have caused it, and as well asked for any logs on their side to be sent to me for further troubleshooting. Then I would escalate internally. He seemed to like this and started smiling and nodding.
He asked my strength and I noted how in every performance review I have ever received, my managers have noted that my attitude, positivity, communication, and mentorship is invaluable and is why I am always assigned to work with new college hires, interns, and junior devs. And this is also why I am usually the point of contact within my team to interface with other teams as I am usually the easiest to talk to and why I am also in charge of screening L2 defects for customers and usually am the one to assist customers on calls. He also seemed to like this. I made sure to re-iterate how I really want to do devops and how I am really excited about this opportunity. I asked next steps and they said it would be an interview with the head of engineering and that would be the final interview. I was very polite and positive and made them smile and laugh a lot on the call. I followed up the next morning to everyone on the panel with a sincere thank you email.
I have never done a devops interview and not sure at all how this went. I feel like my natural personality showed through with them and they really liked it, but I wished the linux guy asked me more, I really crushed that section. I really hope I get this job but I have no idea how this type of hiring works
https://redd.it/1kd9msw
@r_devops
I had a technical discussion with with a smaller company(around 100-200 employees) and they are filling out a new devops team. I have 7 YOE at large tech companies as a software engineer, but my duties have closer aligned with sys admin, infrastructure, Linux admin, developer, kinda devops, or just whatever is needed. I always wanted to do devops but haven't had the opportunity to pivot. I got an interview at this place who has had this listing up for over a month for an associate devops engineer for the same salary. The recruiter seemed very excited to meet me and I was excited for this job
I had the technical interview yesterday and the first half was asking me my technical experience with CI/CD tools and cloud environments. I tired to answer what I could but told them I was lacking in this area and have always wanted to learn it which is why I am so excited for this associate position. I understand the concepts of the tools and have interacted with them so I could explain them, but I don’t have deep hands on. When they asked me more in depth scripting questions I may have been a little shaky, but eventually came to the correct answer they were looking for.
Then it was the linux infrastructure guys turn who works on infrastructure within the team and he started shotgunning me system level questions that I was able to answer immediately and knew were right. The back and forth continued about 5-7 minutes before he said "okay I think im good" and went back to the main guy who asked me how id troubleshoot an issue. I talked out my thought process and isolated every point of failure and explained the testing for each point, and mentioned system level linux commands that could be used to troubleshoot this and went deeper into checking firewalls and such. After a bit he asked if I couldn’t find anything there what would I do, and I said Id reach out to teams I know who may interact with this application and ask if any major changes have been pushed out recently that may have caused it, and as well asked for any logs on their side to be sent to me for further troubleshooting. Then I would escalate internally. He seemed to like this and started smiling and nodding.
He asked my strength and I noted how in every performance review I have ever received, my managers have noted that my attitude, positivity, communication, and mentorship is invaluable and is why I am always assigned to work with new college hires, interns, and junior devs. And this is also why I am usually the point of contact within my team to interface with other teams as I am usually the easiest to talk to and why I am also in charge of screening L2 defects for customers and usually am the one to assist customers on calls. He also seemed to like this. I made sure to re-iterate how I really want to do devops and how I am really excited about this opportunity. I asked next steps and they said it would be an interview with the head of engineering and that would be the final interview. I was very polite and positive and made them smile and laugh a lot on the call. I followed up the next morning to everyone on the panel with a sincere thank you email.
I have never done a devops interview and not sure at all how this went. I feel like my natural personality showed through with them and they really liked it, but I wished the linux guy asked me more, I really crushed that section. I really hope I get this job but I have no idea how this type of hiring works
https://redd.it/1kd9msw
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
We open-sourced internet’s largest incident response glossary with over 500+ terms
We just published a public glossary with **500+ terms** related to incident response, on-call, alerting, SLOs, postmortems, and more. I think this is perhaps the internet's largest glossary for incident response.
👉 [https://spike.sh/glossary](https://spike.sh/glossary)
There's no signups, no fluff. Just a clean, searchable list of terms — each one explained in plain English.
\----
**Why we built this:**
Writing about incident response, I would alaways get stuck on terms like *alert correlation* and wondered if should explain it again? Should I link to something?
There wasn't a single place to encompass all the IR terms. This is when we decided to build on our own.
I really thought we could keep it small and we did in teh initial pass. But then later on we brought in **700+ terms** (thanks, AI 😅).
There were lots of back-and-forth but we did endup narrowing it down to 525 terms that actually matter (*I know it's still absurdly large..*)
Every term answers:
* What it means
* Why it’s relevant in incident response
* (Sometimes) examples, best practices, or how teams use it
ngl, AI was super helpful in many ways, and we did edit *tons* by hand to make sure it wasn’t just noise. Many terms didn’t need extras so we cut it out.
I didn't expect it be as big but it just happened.
\----
Full disclosure - there are still terms we are working to improve upon but hey, its a start and I am happy we got some ting out there for everyone.
PRs are welcome - [https://github.com/spikehq/glossary](https://github.com/spikehq/glossary)
ps: hosted on cloudflare pages which we love. Special shoutout to [11ty.dev](https://11ty.dev) and Claude code
https://redd.it/1kdazr7
@r_devops
We just published a public glossary with **500+ terms** related to incident response, on-call, alerting, SLOs, postmortems, and more. I think this is perhaps the internet's largest glossary for incident response.
👉 [https://spike.sh/glossary](https://spike.sh/glossary)
There's no signups, no fluff. Just a clean, searchable list of terms — each one explained in plain English.
\----
**Why we built this:**
Writing about incident response, I would alaways get stuck on terms like *alert correlation* and wondered if should explain it again? Should I link to something?
There wasn't a single place to encompass all the IR terms. This is when we decided to build on our own.
I really thought we could keep it small and we did in teh initial pass. But then later on we brought in **700+ terms** (thanks, AI 😅).
There were lots of back-and-forth but we did endup narrowing it down to 525 terms that actually matter (*I know it's still absurdly large..*)
Every term answers:
* What it means
* Why it’s relevant in incident response
* (Sometimes) examples, best practices, or how teams use it
ngl, AI was super helpful in many ways, and we did edit *tons* by hand to make sure it wasn’t just noise. Many terms didn’t need extras so we cut it out.
I didn't expect it be as big but it just happened.
\----
Full disclosure - there are still terms we are working to improve upon but hey, its a start and I am happy we got some ting out there for everyone.
PRs are welcome - [https://github.com/spikehq/glossary](https://github.com/spikehq/glossary)
ps: hosted on cloudflare pages which we love. Special shoutout to [11ty.dev](https://11ty.dev) and Claude code
https://redd.it/1kdazr7
@r_devops
spike.sh
Incident Response Glossary: 500+ Key Terms Explained | Spike
Discover 500+ essential terms for incident response and management. Learn definitions, examples, and best practices.
AWS network automation
I find myself in a funny position to redo part of the network in AWS. We have two parts: one is newer and uses transit gateways that are centralized in a single account, the other is older and vpc peering is used between many accounts/vpcs. We try to use terraform for everything. That said, how the $%\^&* do you automate transit gateways?
In terraform, i have taken the following steps in the past
1) Got into the product's terraform repo, run the attachment module we have and it outputs the gateway attachment id.
2) Get into the centralized network account repo, add the cidr/attachment id under a region in a large json file and run it. It adds the attachment id to a route table (non-prod vs prod) and a static route to the cidr is added in other regions as needed. The terraform module I wrote is "clever" and Kerighan's law makes it difficult for me to debug problems with the sub 100 vpcs we have now.
How do people handle this with hundreds of vpcs in a way that keeps state? I can see this working with a bunch of cloudwatch event rules and lambdas, but that seems very push and pray to me whereas I know what I'm getting with terraform before applying it.
https://redd.it/1kdcirx
@r_devops
I find myself in a funny position to redo part of the network in AWS. We have two parts: one is newer and uses transit gateways that are centralized in a single account, the other is older and vpc peering is used between many accounts/vpcs. We try to use terraform for everything. That said, how the $%\^&* do you automate transit gateways?
In terraform, i have taken the following steps in the past
1) Got into the product's terraform repo, run the attachment module we have and it outputs the gateway attachment id.
2) Get into the centralized network account repo, add the cidr/attachment id under a region in a large json file and run it. It adds the attachment id to a route table (non-prod vs prod) and a static route to the cidr is added in other regions as needed. The terraform module I wrote is "clever" and Kerighan's law makes it difficult for me to debug problems with the sub 100 vpcs we have now.
How do people handle this with hundreds of vpcs in a way that keeps state? I can see this working with a bunch of cloudwatch event rules and lambdas, but that seems very push and pray to me whereas I know what I'm getting with terraform before applying it.
https://redd.it/1kdcirx
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
MacOs HomeBrew and Open Source tooling
Hey guys!
Quick question for ya, I've been at a job for awhile now but we just got transitioned over to macOS. We were on windows machines before. Software was always distributed through self service software centers or pushed via org policy.
Now however Im running into issues getting up and running with my dev tooling (mostly cli tools, and local cluster dev). Currently homebrew isnt an approved technology, but its so common to get tools installed that way im not familiar with any other common patterns. Ive been tasked with trying to make an argument to allow it for devs from my team.
Im anticipating security folks and others having a high skepticism because they cannot "own" the software that gets installed there as far as Im aware. The current pattern would have me contact the helpdesk to install software via .pkg or be distributed.
Currently other package managers are allowed - like conda, npm, yarn, etc. But I know its not quite an apples to apples comparison.
What arguments would you make to allow homebrew into the ecosystem? Are any of your jobs able to track whats installed accurately? Im assuming the MDR/AV software locally would pick up something.
https://redd.it/1kdcehg
@r_devops
Hey guys!
Quick question for ya, I've been at a job for awhile now but we just got transitioned over to macOS. We were on windows machines before. Software was always distributed through self service software centers or pushed via org policy.
Now however Im running into issues getting up and running with my dev tooling (mostly cli tools, and local cluster dev). Currently homebrew isnt an approved technology, but its so common to get tools installed that way im not familiar with any other common patterns. Ive been tasked with trying to make an argument to allow it for devs from my team.
Im anticipating security folks and others having a high skepticism because they cannot "own" the software that gets installed there as far as Im aware. The current pattern would have me contact the helpdesk to install software via .pkg or be distributed.
Currently other package managers are allowed - like conda, npm, yarn, etc. But I know its not quite an apples to apples comparison.
What arguments would you make to allow homebrew into the ecosystem? Are any of your jobs able to track whats installed accurately? Im assuming the MDR/AV software locally would pick up something.
https://redd.it/1kdcehg
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Need Advice on scaling my platform architecture
I’m building a trading platform where users interact with a chatbot to create trading strategies. Here's how it currently works:
User chats with a bot to generate a strategy
The bot generates code for the strategy
FastAPI backend saves the code in PostgreSQL (Supabase)
Each strategy runs in its own Docker container
Inside each container:
Fetches price data and checks for signals every 10 seconds
Updates profit/loss (PNL) data every 10 seconds
Executes trades when signals occur
The Problem:
I'm aiming to support 1000+ concurrent users, with each potentially running 2 strategies — that's over 2000 containers, which isn't sustainable. I’m now relying entirely on AWS.
Proposed new design:
Move to a multi-tenant architecture:
One container runs multiple user strategies (thinking 50–100 per container depending on complexity)
Containers scale based on load
Still figuring out:
How to start/stop individual strategies efficiently — maybe an event-driven system? (PostgreSQL on Supabase is currently used, but not sure if that’s the best choice for signaling)
How to update the database with the latest price + PNL without overloading it. Previously, each container updated PNL in parallel every 10 seconds. Can I keep doing this efficiently at scale?
Questions:
1. Is this architecture reasonable for handling 1000+ users?
2. Can I rely on PostgreSQL LISTEN/NOTIFY at this scale? I read it uses a single connection — is that a bottleneck or a bad idea here?
3. Is batching updates every 10 seconds acceptable? Or should I move to something like Kafka, Redis Streams, or SQS for messaging?
4. How can I determine the right number of strategies per container?
5. What AWS services should I be using here? From what I gathered with ChatGPT, I need to:
Create a Docker image for the strategy runner
Push it to AWS ECR
Use Fargate (via ECS) to run it
https://redd.it/1kdftny
@r_devops
I’m building a trading platform where users interact with a chatbot to create trading strategies. Here's how it currently works:
User chats with a bot to generate a strategy
The bot generates code for the strategy
FastAPI backend saves the code in PostgreSQL (Supabase)
Each strategy runs in its own Docker container
Inside each container:
Fetches price data and checks for signals every 10 seconds
Updates profit/loss (PNL) data every 10 seconds
Executes trades when signals occur
The Problem:
I'm aiming to support 1000+ concurrent users, with each potentially running 2 strategies — that's over 2000 containers, which isn't sustainable. I’m now relying entirely on AWS.
Proposed new design:
Move to a multi-tenant architecture:
One container runs multiple user strategies (thinking 50–100 per container depending on complexity)
Containers scale based on load
Still figuring out:
How to start/stop individual strategies efficiently — maybe an event-driven system? (PostgreSQL on Supabase is currently used, but not sure if that’s the best choice for signaling)
How to update the database with the latest price + PNL without overloading it. Previously, each container updated PNL in parallel every 10 seconds. Can I keep doing this efficiently at scale?
Questions:
1. Is this architecture reasonable for handling 1000+ users?
2. Can I rely on PostgreSQL LISTEN/NOTIFY at this scale? I read it uses a single connection — is that a bottleneck or a bad idea here?
3. Is batching updates every 10 seconds acceptable? Or should I move to something like Kafka, Redis Streams, or SQS for messaging?
4. How can I determine the right number of strategies per container?
5. What AWS services should I be using here? From what I gathered with ChatGPT, I need to:
Create a Docker image for the strategy runner
Push it to AWS ECR
Use Fargate (via ECS) to run it
https://redd.it/1kdftny
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
🚨 DevOps Interview in 2 Days with Zero Experience – Need Your Guidance!
Hey r/devops community,
I'm reaching out for some advice. I have an interview for a DevOps internship in just two days. My background includes basic knowledge of Git, Linux, and Python, but I have no prior experience in DevOps.
Given the limited time, what key areas should I focus on to make the most of my preparation? Any resources, tips, or guidance would be greatly appreciated.
Thank you in advance for your support!
https://redd.it/1kdk7va
@r_devops
Hey r/devops community,
I'm reaching out for some advice. I have an interview for a DevOps internship in just two days. My background includes basic knowledge of Git, Linux, and Python, but I have no prior experience in DevOps.
Given the limited time, what key areas should I focus on to make the most of my preparation? Any resources, tips, or guidance would be greatly appreciated.
Thank you in advance for your support!
https://redd.it/1kdk7va
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Redis is open source again?
Redis seems to be Open Source again!!!
With Redis 8, the Redis community is thinking of going back to open source.
Source: https://thenewstack.io/redis-is-open-source-again/
Guys let's discuss this. Is this real?
https://redd.it/1kdlg94
@r_devops
Redis seems to be Open Source again!!!
With Redis 8, the Redis community is thinking of going back to open source.
Source: https://thenewstack.io/redis-is-open-source-again/
Guys let's discuss this. Is this real?
https://redd.it/1kdlg94
@r_devops
The New Stack
Redis Is Open Source Again
Redis, the popular in-memory data store, is available under an open source license again.
Is this a good DevOps book?
Is this a good DevOps book? I'm planning to buy a book on Azure DevOps."
https://www.amazon.com/Beginning-Azure-DevOps-Releasing-Applications/dp/1394165889
https://redd.it/1kdeimp
@r_devops
Is this a good DevOps book? I'm planning to buy a book on Azure DevOps."
https://www.amazon.com/Beginning-Azure-DevOps-Releasing-Applications/dp/1394165889
https://redd.it/1kdeimp
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
As a DevOps Engineer, do I need to know databases?
The question pretty much. How important is it to know dbs to be a better DevOps Engineer? Mind you, I'm already a DevOps Engineer but there's barely anything I'm touching db related, or even networking related TBH. Well, networking aside, how important is it to know dbs? I mean, I know dbs (Postgres and MSSQL) a bit, is it needed to know a whole lot more?
https://redd.it/1kdrpcq
@r_devops
The question pretty much. How important is it to know dbs to be a better DevOps Engineer? Mind you, I'm already a DevOps Engineer but there's barely anything I'm touching db related, or even networking related TBH. Well, networking aside, how important is it to know dbs? I mean, I know dbs (Postgres and MSSQL) a bit, is it needed to know a whole lot more?
https://redd.it/1kdrpcq
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
i made bikya for selling used products and real estate, please check it out!
made it fully in PHP Any tips would be helpful
https://bikya.infy.uk/
https://redd.it/1kdvift
@r_devops
made it fully in PHP Any tips would be helpful
https://bikya.infy.uk/
https://redd.it/1kdvift
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Cobbler/Chef Educational Resources
I’m a network engineer by day and part time lab assistant to earn a few extra bucks in the evening. They are wanting in the next 90 days to get me spun up on assisting with tickets as the physical lift and rack and cable audit is wrapping up. They utilize cobbler and chef today and asked I start learning it, I’ve never touched any of these. Are there any good resources or recommendations for getting basic down with these? I have some familiarity with ansible but that’s it.
https://redd.it/1kdv75y
@r_devops
I’m a network engineer by day and part time lab assistant to earn a few extra bucks in the evening. They are wanting in the next 90 days to get me spun up on assisting with tickets as the physical lift and rack and cable audit is wrapping up. They utilize cobbler and chef today and asked I start learning it, I’ve never touched any of these. Are there any good resources or recommendations for getting basic down with these? I have some familiarity with ansible but that’s it.
https://redd.it/1kdv75y
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What is k8s in bare metal?
Newbie understanding: If I'm not mistaken, k8s in bare metal means deploying/managing a k8s cluster in a single-node server. Otherwords, control plane and node components are in a single server.
However, in managed k8s services like AWS (EKS) and DigitalOcean (DOKS). I see that control plane and node components can be on a different servers (multi-node).
So which means EKS and DOKS are more suitable for complex structure and bare metal for manageble setup.
I'll appreciate any knowledge/answer shared for my question. TIA.
https://redd.it/1kdy5af
@r_devops
Newbie understanding: If I'm not mistaken, k8s in bare metal means deploying/managing a k8s cluster in a single-node server. Otherwords, control plane and node components are in a single server.
However, in managed k8s services like AWS (EKS) and DigitalOcean (DOKS). I see that control plane and node components can be on a different servers (multi-node).
So which means EKS and DOKS are more suitable for complex structure and bare metal for manageble setup.
I'll appreciate any knowledge/answer shared for my question. TIA.
https://redd.it/1kdy5af
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Jira time logging for DevOps
I work at a big company and we are required to log the time we work on jira tickets to measure our productivity and for other reports for management. Some times I work the 8 hours but most of the time I finish my tasks and sits free most of the day. So sometimes I fake the logged hours so they know that I'm fully utilized. I've raised this with my manager and he said to fill my backlog and improve the system. I get that I can find somethings to be improved but it won't be the case all the time and I'll have some idle time in the end.
So my questions to you is:
Do you face similar situations at your company? What does it looks like?
How do you measure the productivity of the team?
Is the logged time a good measure to check the engineers productivity?
Any other thoughts? :) Thanks
https://redd.it/1kdxiak
@r_devops
I work at a big company and we are required to log the time we work on jira tickets to measure our productivity and for other reports for management. Some times I work the 8 hours but most of the time I finish my tasks and sits free most of the day. So sometimes I fake the logged hours so they know that I'm fully utilized. I've raised this with my manager and he said to fill my backlog and improve the system. I get that I can find somethings to be improved but it won't be the case all the time and I'll have some idle time in the end.
So my questions to you is:
Do you face similar situations at your company? What does it looks like?
How do you measure the productivity of the team?
Is the logged time a good measure to check the engineers productivity?
Any other thoughts? :) Thanks
https://redd.it/1kdxiak
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community