Reddit DevOps
267 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Discussion: what are must-read books for DevOps engineer?

Hi guys,

I am looking into switching into devops field from fulltime web dev. And I m curios what are the most important and up-to-date books someone like me can read? Even if they're not directly connected to, but would be helpful in future.

Share you thoughts! Thanks!

https://redd.it/1igte2s
@r_devops
How do you handle log noise and event overload in high-volume environments?

Hey everyone, I’m curious about how you manage log overload in fast-growing infrastructures. Between low-priority warnings, duplicate events, and false positives, it can be tough to separate the noise from what actually matters.

Do you use filtering, deduplication, or automation to keep things manageable? What strategies or tools have helped you cut down log bloat while still catching critical alerts?

https://redd.it/1igt4p3
@r_devops
DevOps to Data Platforms

I'm looking for some advice on how to quickly get up to speed with a new job.

Previously I was working in a dotnet shop at a smaller company. I was managing Azure, Pipelines, WAFs, Networking, basically anything infrastructure related that wasn't inside the app itself. - typical "devs are bad at networking" kinda gig.

Now I'm at a bigger company, with a dispersed team, where our only job is to manage a data platform for data engineers. The problem is, I don't know the first thing about data. I've tried to search around but all the information I'm finding is mostly geared towards learning how to manage the data itself, not managing the platform. - I remember struggling with this at the dotnet shop but I had a LOT better support so the devs would interact with me and teach me what they were doing, so in turn I could help them bridge their gaps with infrastructure. That doesn't feel like a thing I can do at this new role, so I'm trying my best to cover my ass.


Any Advice? - I can google things as they come up, but I'd like to somewhat get ahead of the curve so I don't have to push off every question I'm asked.

https://redd.it/1igz14r
@r_devops
How to get started in dev ops? Certs?

I am going on 3 years experience in QA with both manual and mobile automation. It seems QA and front end development are very saturated. My friend/mentor says Dev ops is the next logical step from QA roles. Dev ops also seems less saturated. How do I get started? What certs should I get in automation or dev ops? Thoughts?

https://redd.it/1ih0i1c
@r_devops
How much DSA should I know for a DevOps or SRE role?

For real, I don’t know how much leetcode and DSA I need to master aside the tools of the DevOps trade to attend a technical interview for DevOps. Can someone help me?

https://redd.it/1ih2hhi
@r_devops
Devops/Infra/SRE/Platform Engineer Jobs

So I want to switch to a new job and was wondering other than LinkedIn what all have people used for looking for a job!

https://redd.it/1ih4jpa
@r_devops
database consolidation

We have a lot of database servers. Generally one per app, and then the dev and stage instances have their own servers. Note, I'm talking servers, not databases.

We think this is too many but not sure what to do about it. I'm curious about people's philosophies here.

Large consolidated instances seem to be difficult to maintain and mean a lot of applications go down if one goes down. So I don't think we want to centralize to that degree.

One thing we've thought about is combining test/dev on the same servers. Not sure they really need their own.

We want to keep prod separate though.

But maybe someone smarter than me has thought about this. Curious what people are doing.

https://redd.it/1ih89ia
@r_devops
Help regarding the conversion from Aurora Serverless v1 to the provisioned instance.

I ma currently int he middle of updating my RDS serverless v1 to serverless v2, but in the official documentation there is a step which involves converting serverless v1 to a provisioned instance first, i cannot find any such option on the console directly, how do i go about?



https://redd.it/1ihasy0
@r_devops
What should I do?

Hey people i am a newbie to DevOps just starting out by looking at roadmap.sh and kodekloud courses. I have came across various posts on many different platforms that learning in public gets real attention and helps growing network, I do share my learnings on Linkedin and twitter ( for a long time now ) but can't see getting recognition. What else I should do i figure making short videos for instagram and youtube shorts might be good way to deliver content but dont know how to do all the stuff ( editing, recording, etc) can yall help me out ?

https://redd.it/1ihd1in
@r_devops
Need Help Integrating AWS ECS Cluster, Service & Task with LGTM Stack using Terraform

So I've been working on Integrating LGTM Stack into my current AWS Infrastructure Stack.

Let me first explain my current work I've done so far,

\######LGTM Infra :

\- Grafana = Using AWS Managed Grafana with Loki, Mimir and Tempo Data Source deployed using Terraform

\- Loki, Tempo and Mimir servers are hosted on EC2 using Docker Compose and using AWS S3 as Backend storage for all three.

\- To push my ECS Task Logs, Metrics and Traces, I've added Side-Cars with current Apps Task Definition which will run alongside with app container and push the data to Loki, Tempo and Mimir servers. For Logs I'm using __aws firelens__ log driver, for Metrics and Traces I'm using Grafana Alloy.

LGTM Server stack is running fine and all three data are being pushed to backend servers, now i'm facing issue with labeling like the metrics and traces are pushed to Mimir and Tempo backend servers but how will i identify from which Cluster, Service and Task i'm getting these Logs, Metrics and Traces.

For logs it was straight forward since i was using AWS Firelens log driver, the code was like this:

>log_configuration = {

>logDriver = "awsfirelens"

>options = {

>"Name" = "grafana-loki"

>"Url" = "${var.loki_endpoint}/loki/api/v1/push"

>"Labels" = "{job=\\"firelens\\"}"

>"RemoveKeys" = "ecs_task_definition,source,ecs_task_arn"

>"LabelKeys" = "container_id,container_name,ecs_cluster",

>"LineFormat" = "key_value"

>}

>}

as you can see in the below screenshots, ecs related details are getting populated on grafana,
: https://i.postimg.cc/HspwKRVW/loki.png

and for the same i was able to create dashboard as well with some basic filtering and search box,
: https://i.postimg.cc/tT36vNbV/loki-dashboard.png

Now comes the Metrics a.k.a Mimir part:

for this i used Grafana Alloy, and used below config.alloy config file:

>prometheus.exporter.unix "local_system" { }

>

>prometheus.scrape "scrape_metrics" {

>targets = prometheus.exporter.unix.local_system.targets

>forward_to = [prometheus.relabel.add_ecs_labels.receiver\]

>scrape_interval = "10s"

>}

>

>remote.http "ecs_metadata" {

>url = "ECS_METADATA_URI"

>}

>

>prometheus.relabel "add_ecs_labels" {

>rule {

>source_labels = ["__address__"\]

>target_label = "ecs_cluster_name"

>regex = "(.*)"

>replacement = "ECS_CLUSTER_NAME"

>}

>

>rule {

>source_labels = ["__address__"\]

>target_label = "ecs_service_name"

>regex = "(.*)"

>replacement = "ECS_SERVICE_NAME"

>}

>

>rule {

>source_labels = ["__address__"\]

>target_label = "ecs_container_name"

>regex = "(.*)"

>replacement = "ECS_CONTAINER_NAME"

>}

>

>forward_to = [prometheus.remote_write.metrics_service.receiver\]

>}

>

>prometheus.remote_write "metrics_service" {

>endpoint {

>url = "${local.mimir_endpoint}/api/v1/push"

>headers = {

>"X-Scope-OrgID" = "staging",

>}

>}

>}


I used AWS to create this config in Param store and added another app task side car which will load this config file, run a custom script which will fetch the ECS Cluster name from ECS_CONTAINER_METADATA_URI_V4 and passed Service Name and Container Name as ECS Task Definition Environment Variable.

so after all this, I was able to do the relabeling and populate the Cluster, Service and Task name on Mimir Data Source:

: https://i.postimg.cc/Gh8LchBX/mimir.png


Now when I was trying to use Node\_Exporter\_Full Grafana dashboard for the metrics, I was getting the metrics but for unix level filtering only,

: https://i.postimg.cc/Jn0wPPZp/mimir-dashboard-1.png

:
https://i.postimg.cc/mD5vqCSB/mimir-dashboard-filter.png

so i did some dashboard JSON filtering and was able to get ECS Cluster Name, ECS Service Name & ECS Container Name for the same dashboard,

: https://i.postimg.cc/2yLsfyHv/mimir-dashboard-2.png

but now I'm not able to get the metrics on dashboard,

It's been only 2 Weeks since I've started the Observability and before that i didn't know much about these apart from the term Observability so i might be doing something wrong with the Metrics for my Custom Node Exporter Dashboard.

Do I need to relabel the exisitng labels like __job__ and __host__ and replace them with my added labels like ECS Service or Container Names to fetch the metrics on the basis of ECS Containers?

Since i'm doing this for the first time so not sure much about this.

If anyone here has done something like same, can you please help me with this implementation??

Next thing once this is done then I'll be going for like aggregated metrics based on ECS Services since there might be more than one task running for one ecs services and then i believe i'll be needing the something like same relabeling for tempo traces as well.

Please help me guys for this.

Thank you!!!

https://redd.it/1iheu0p
@r_devops
Linux Server which can run Virtualbox for a month, where to go ? EU

Customer's client provided me a dev environment based on Vagrant. I'm not looking for alternatives for that, it's the way it is. That vagrant is running k3s. I tried with my old Intel MB Pro but I'm lacking memory. I need a server which can run Virtualbox, and with a short contract, max 2 months. Where should I go ?

Hope this post is ok with Mods, asking for vendors.

https://redd.it/1ihfug0
@r_devops
Cannot reach service by node ip and port from browser

I'm running Docker Desktop on a Windows 11 PC. I want to try the built-in Kubernetes based on Kind. It works, although I cannot reach the service by node ip and port. I tested the connection inside the cluster it works fine. I also tried disabling firewalls. When I tried Minikube with Hyper V driver it worked fine, using the docker driver gave me the same problems like Kind has. How to solve this?

https://redd.it/1ihhe59
@r_devops
I built an AI agent for website monitoring - looking for feedback

Hey everyone, I wanted to share [https://flowtest.ai/](https://flowtest.ai/), a product my 2 friends and I are working on. We’d love to hear your feedback and opinions.

Everything started, when we discovered that LLMs can be really good at browsing websites simply by following a chatGPT-like prompt. So, we built LLM agent and gave it tools like keyboard & mouse control. We parse the website and agent does actions you prompt it to do. This opens lots of opportunities for website monitoring and testing. It’s also a great alternative to Pingdom.

Instead of just pinging a website, you can now prompt an AI agent to visit and interact with a website as a real user. Even if the website is up, agent can identify other issues and immediately alert you if certain elements aren't functioning correctly e.g. 3rd party app crashes or features fail to load.

Once you set a frequency for the agent to run its monitoring flow, it will actually visit your website each time. LLMs are now smart enough and combined with our web parsing, if some web elements change, agent will adapt without asking your help.

**Here are a few more complex examples of how our first customers are using it:**

* Agent visits your site, enters a keyword in a search box, and verifies that relevant search results appear.
* Agent visits your login page, enters credentials, and confirms successful login into the correct account.
* Agent completes a purchasing flow by filling in all necessary fields and checks if the checkout process works correctly.

We initially launched it as a quality assurance testing automation agent but noticed that our early customers use it more as a website uptime monitoring service.

We offer 7 days free trial (no cc required), but if you’d like to try it for a longer period, just DM me, and I'll give you a month free of charge in exchange for your feedback.

We’d love to hear all your feedback and opinions.



https://redd.it/1ihhv45
@r_devops
Hyperping vs. Better Stack vs. OneUptime for observability

Which one is better? Pricing is not the problem.

I am specifically interested in synthetic monitoring with playwright.

https://redd.it/1ihkrew
@r_devops
Looking to get back into a DevOps role.

Looking for any tips on what I need to focus on when interviewing. I’ve worked in IT for 20+ years, I’ve been a team lead on Linux and Virtualization teams, have worked with most automation tools, and have sold some of these products. It’s been a while since I’ve sat in this role and looking for help on anything new in the market, what I should focus on now and what to expect from the interviews.

https://redd.it/1ihm7vc
@r_devops
Best way to sync a private GitHub repo to a shared remote machine without shared credentials?

My team and I have a remote desktop machine connected to a PLC, conveyor belt, and sensors. We need to clone and pull updates from our private GitHub repository to this machine. However, we’re stuck on how to do this efficiently without creating a shared user account on the machine (which would require sharing credentials).



Here’s the issue:

\- We can’t create a GitHub account for the machine because it doesn’t have an official organization email.

\- Sharing a single user account on the machine isn’t ideal and goes against best practices.

\- We need to be able to:

\- Clone and pull the latest changes to the machine.

\- Push changes made on the remote machine back to the repo using our individual GitHub credentials.



**Options we’re considering:**

1. Use tools like TeamViewer or SSH tunnels to transfer files between our local machines (which are already set up) and the remote machine.

2. Set up GitHub on the remote machine but deal with the inefficiency of constantly asking for user credentials to push changes.



What’s the best practice here? Are there tools or workflows (deploy keys, GitHub Actions?) designed for this kind of scenario? Any advice or recommendations would be greatly appreciated!

https://redd.it/1ihjvoj
@r_devops
Learning GCP and Terraform at the same time?

I'm confident at frontend development. I know the basics of Node and Postgres. But I'm weak on DevOps.

I've traditionally been a freelancer and used tools like Vercel and Supabase. However now I have a job with a startup and I need to learn GCP.

I've only spent 1/2 a day but I find using Google Cloud's website and `gcloud` in the terminal quite awkward.

Does it make sense to use something like Terraform from the start? I like the idea of a code-first approach, and being able to switch providers in the future is also nice (we're on GCP as we got a bunch of free credits).

https://redd.it/1ihpl7g
@r_devops
How Much Do You Spend on Databases? (2-Min Survey)

Hey all,

We’re doing a quick research study on **database costs & infrastructure**—figuring out how developers & companies use PostgreSQL, InfluxDB, ClickHouse, and managed DBaaS.

**Common problems we hear:**

* 💸 AWS RDS costs way more than expected
* 😩 Managing **high availability & scaling** is painful
* 🔗 Vendor lock-in sucks

🔥 If you run databases, we’d **love your insights!**

👉 **Survey Link (2 mins, no email required):** [https://app.formbricks.com/s/cm6r296dm0007l203s8953ph4](https://app.formbricks.com/s/cm6r296dm0007l203s8953ph4)

(Results will be shared back with the community!)

https://redd.it/1ihvket
@r_devops