Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
How can i truly grow as a fullstack developer in the AI Era?


I’m a solo full-stack developer at my company, managing infrastructure and development with my team lead. While I can deploy applications using Kubernetes, Docker, and other modern tools, I rely heavily on AI (ChatGPT, DeepSeek) to complete tasks. This has made me efficient, but I lack deep technical understanding and struggle to answer in-depth questions, making interviews challenging.

With AI rapidly evolving, I want to future-proof my career. My main concerns:
1. How can I build a deeper understanding of technologies instead of just relying on AI?
2. What skills should I focus on to stay competitive and confident in interviews?
3. Should I transition towards AI-related development, or strengthen core engineering skills?

Looking for advice from experienced developers—how do I break out of this cycle and grow meaningfully?

https://redd.it/1iyc1kw
@r_devops
How to Prevent Ephemeral Storage from Filling Up in AWS Fargate with FireLens & Datadog?

I'm running a PHP app on AWS ECS Fargate and using FireLens (Fluent Bit) to send logs to Datadog. However, I'm facing an issue where ephemeral storage fills up quickly due to backpressure.

I want to:

* **Limit RAM usage** for log buffering (e.g., 256MB).
* **Use ephemeral storage only when needed** (max 5GB).
* **Increase worker threads** (16) to flush logs faster.

I'm using `storage.type=filesystem`, but Fargate **doesn’t allow sourcePath** for volumes, so I can't explicitly define a storage path. My task definition keeps failing.

How can I configure FireLens in Fargate to handle backpressure efficiently without filling up storage? Any best practices?

https://redd.it/1iyhwgb
@r_devops
Feeling Stuck in My DevOps Role – Need Career Advice

Hey DevOps folks,

I'm a DevOps engineer with 2 years of experience working at a startup. I primarily work with AWS cloud and some Azure (mostly pipelines), managing 7 applications across 3 environments each. Recently, we migrated to ECS with a cross-account setup, which was an exciting challenge. However, now that most things are automated with Terraform, there’s not much left to do—rarely any production issues, and my work feels stagnant.

Since I’m still early in my career, I don’t want to get stuck doing just this. I’m planning to switch to a new company and need some advice:

1. What type of company should I target? (Startups vs. bigger companies, service-based vs. product-based)

2. What technologies should I focus on learning? (I have hands-on experience with AWS, Azure DevOps, Jenkins, Prometheus, and Grafana. I know Kubernetes but haven’t used it in a real project.)

3. Any other suggestions? (e.g., full remote jobs, certifications, or alternative career paths)

Would really appreciate your insights!!

https://redd.it/1iyipp2
@r_devops
Jenkins CICD pipeline migration to GitLab

Hey guys,
What's your experience with migrating the CICD pipelines from jenkins to GitLab? Is it really the only way to rewrite the CICD files one by one or is there a tool for that? I hat do you think,what's the best practice?

https://redd.it/1iyjoyf
@r_devops
Debug & chill #2 - Articles of infra & devops debugging

Thrilled to Share the Second Episode of My Debug & Chill Series!

Back in 2020, I started documenting some of my most intriguing troubleshooting adventures, and now I’m releasing them as a blog series. Each post dives into real problems I faced, how I used different tools, and my step-by-step logic.

This second installment dives into a puzzling case of packet duplication in a VMware environment—a seemingly simple scenario that turned out to be much trickier than it looked. Curious about the cause and how we tracked it down?

Check out Debug & Chill #2 here:

https://royreznik.substack.com/p/debug-and-chill-2-strange-packet

I’d love to hear your thoughts or any similar experiences you’ve had. Let me know in the comments!

https://redd.it/1iyjs7q
@r_devops
Using engineering metrics for good!

Can you share some examples of implementing engineering metrics in your daily workflow that positively impact your team performance?

https://redd.it/1iyin9z
@r_devops
Analyzing OpenTelemetry Data in Real Time with SQL - All Open Source

Hi folks!

I recently wrote a blog post on how to analyze OTel data in real time with SQL, using Feldera and Grafana, both open source tools.

We collect data from OTel collector and send it to your self hosted Feldera instance for analysis, and visualize it with Grafana.

The blog post: https://www.feldera.com/blog/opentelemetry

We also have a more detailed use case article: https://docs.feldera.com/use\_cases/otel/intro

Feel free to ask any questions, and hopefully this is useful to you!

https://redd.it/1iymaze
@r_devops
Just Started a DevOps Blog – Looking for Feedback & Suggestions! 🚀

Hey r/devops community!

I recently launched a personal blog where I share my experiences, challenges, and insights as a DevOps engineer. My goal is to post weekly about new technologies, interesting problems I encounter, and solutions I find useful in real-world scenarios.

My latest post is about EKS Auto Mode – I cover provisioning from scratch, deploying both stateless and stateful applications, and all the details involved in setting up a cluster in Auto Mode. I believe it could be a game-changer in the field, and I’d love to hear your thoughts on it!

👉 https://haykops.com/posts/eks-auto-mode/

I'm open to any feedback—whether it's about the content, topics you'd like me to cover, or how I can make the blog more valuable for the DevOps community.

Would love to hear your thoughts! Thanks in advance. 🙌

https://redd.it/1iyligy
@r_devops
I built an open-source dashboard for VM images

Hi,

I built this project because I wanted an easier way to visualise all Virtual Machine Images. I was also just very sick of people not following naming conventions and keeping track of images in spreadsheets.

Img-Dash is a simple dashboard for VM images across AWS, GCP and Azure that you can run locally.

Features:-

Consolidated view of all VM images and their data
View, Attach or Delete contextual information (IaC code, Event Data, Compliance Scripts)
Even displays which VMs are using which Image
Simple search and list of images in the dashboard

As a DevOps engineer, it has been ages since I've developed a full stack application so feedback is much appreciated!

Repo: https://github.com/shaozae/Img-Dash

https://redd.it/1iyq02j
@r_devops
HELP Trying to optimize my Github Action to not install things every time. I'm new to this CI/CD thing

Hi friends, I'm looking for advice on speeding up my GitHub Actions workflow. Currently, a significant portion of my workflow which is taking some time involves:

sudo apt-get install -y gettext
yarn install --frozen-lockfile --silent
yarn my custom script which runs the react-gettext-parser npm library

These steps are executed on every push/PR, and I'm wondering if there's a more efficient way to handle them?
I wonder if it would be better if I could, for instance, compile what I'm installing, and instead use that compiled thing when my action triggers without having to install everything every time.

Has anyone faced similar challenges and found effective solutions? I'm open to any suggestions or best practices you can share. Thanks in advance : )

https://redd.it/1iyr471
@r_devops
How can I improve at performance tuning topologies/systems/deployments?

Machine learning engineer here, ~4.5 YOE. Most of my XP has been training and evaluating models. But I just started a new job where my primary responsibility will be to optimize systems/pipelines for low-latency, high-throughput inference. TL;DR: I struggle at this and want to know how to get better.

Model building and model serving are completely different beasts, requiring different considerations, skill sets, and tech stacks. Unfortunately I don't know much about model serving - my sphere of knowledge skews more heavily towards data science than computer science, so I'm only passingly familiar with hardcore engineering ideas like networking, multiprocessing, different types of memory, etc. As a result, I find this work very challenging and stressful.

For example, a typical task might entail answering questions like the following:

- Given some large model, should we deploy it with a CPU or a GPU?

- If GPU, which specific instance type and why?

- From a cost-saving perspective, should the model be available on-demand or serverlessly?

- If using Kubernetes, how many replicas will it probably require, and what would be an appropriate trigger for autoscaling?

- Should we set it up for batch inferencing, or just streaming?

- How much concurrency will the deployment require, and how does this impact the memory and processor utilization we'd expect to see?

- Would it be more cost effective to have a dedicated virtual machine, or should we do something like GPU fractionalization where different models are bin-packed onto the same hardware?

- Should we set up a cache before a request hits the model? (okay this one is pretty easy, but still a good example of a purely inference-time consideration)

The list goes on and on, and surely includes things I haven't even encountered yet.

I am one of those self-taught engineers, and while I have overall had considerable success as an MLE, I am definitely feeling my own limitations when it comes to performance tuning. To date I have learned most of what I know on the job, but this stuff feels particularly hard to learn efficiently because everything is interrelated with everything else: tweaking one parameter might mean a different parameter set earlier now needs to change. It's like I need to learn this stuff in an all-or-nothing fasion, which has proven quite challenging.

Does anybody have any advice here? Ideally there'd be a tutorial series (preferred), blog, book, etc. that teaches how to tune deployments, ideally with some real-world case studies. I've searched high and low myself for such a resource, but have surprisingly found nothing. Every "how to" for ML these days just teaches how to train models, not even touching the inference side. So any help appreciated!

https://redd.it/1iysmlj
@r_devops
Can Kaniko build a container with provenance=mode-min?

When going through the Kaniko docs I don't see an area for the Kaniko "--provenance" flag. Is setting this provenance level not a feature of Kaniko? Is there an alternate way of setting provenance with Notary/Oras? Is the provenance level set to min by default?

https://redd.it/1iyrvv9
@r_devops
can you guys roast my resume?

Hello everyone, I'm a masters student who has just started to apply for jobs. I don't have much experience in the IT field so I created my resume based on projects solely. I'm looking for jobs in devops(I know companies don't hire freshers for devops role) and SRE, cloud engineer and related jobs. I'm still learning devops so that is the reason I don't have any devops but will soon be adding it after learning.
can any of you guys could roast/review my resume? it would be really appreciated.

Resume link : https://www.reddit.com/r/aws/comments/1iyws7u/can\_you\_guys\_roast\_my\_resume/

Thanks in advance!

https://redd.it/1iywybb
@r_devops
Should I get degree in Cloud computing or Software Engineering from WGU

I have associates degree in computer science and internship experience in devops. Applying for jobs and no luck. thinking about getting bachelors degree from WGU in cloud computing or I should apply for Software engineering , Data Analytics or Cybersecurity?

https://redd.it/1iyypoh
@r_devops
What to do

I am looking to pursue a major . Should I choose computer engineer, software engineer, or electrical engineer. If I want to be come a DevOps.

https://redd.it/1iyz313
@r_devops
How do you manage database access?

We have a few AWS Aurora PostgreSQL databases where we manage database roles for our applications. This is done via psql.

The obvious problem is that it's very manual and not visible without running multiple psql commands. It's tedious to see which roles are available and which schemas, tables, columns they have access to.

What do you all use to visualize and manage this? Even better if it's a universal tool for other kinds of databases (MySQL, Trino, etc.)

Thanks for any advice!

https://redd.it/1iyqa64
@r_devops
IIS vs NGINX vs Apache

I had to install and configure a server to deploy web applications and APIs built in Node.js, I must clarify that these applications are intranet, they will be used only inside of the local company network. This is my first server and I was a little bit scared so I started with Windows Server. I built an Express server to serve each web app and I managed to deploy every single web service.

I wanted to go with a built-in web server to handle issues such as caching and security, a gateway to protect these APIs and serve these applications and I went with IIS, but I am having trouble while deploying web apps that are developed with React. All I hear about IIS is that it is crap and it only fits with Microsoft technologies.

I have the freedom to change anything I want so I want to ask you: should I change the host to a Linux distro and use NGINX or Apache to fulfill my needs even though I don't have experience with built-in web servers o with Linux in general? Or should I stick with IIS from now until I learn about Linux and web servers properly?

https://redd.it/1iz1kt3
@r_devops
Vagrant - WSL - Ansible

Anyone have some knowledge on how to make this set up work properly? I figured out how to make wsl and windows and vagrant to work together on virtualbox but it’s the ansible piece that’s killing my project.

My goal is pretty simple, I am learning ansible so I want to spin up 3 Ubuntu VMs in vagrant then have ansible run through each of the nodes and create a new user on each machine. My problem seems to happen with at ssh as it gets stuck after creating the first vm.

https://redd.it/1iz1kv3
@r_devops
Is there a debugger or some tool to check which container calls which container?

I have like 30 containers calling one another using messages and http calls, and sometimes it's impossible to know what is calling what because each services are coupled to each other and keep calling one another.

https://redd.it/1iz4bk9
@r_devops
SonatypeNexus OSS: Error during transaction commit and more DB errors

I am using Nexus version `3.70.1-02` which is the last version that supports OrientDB. It is deployed on a k8s cluster as a pod. I have been facing multiple issues ever since I tried to fetch a statistics about sizes of different repositories hosted on the nexus using `kubectl exec -it -u root <nexus-pod>` and executed following commands:

java -jar /opt/sonatype/nexus/lib/support/nexus-orient-console.jar
> CONNECT PLOCAL:/nexus-data/db/component admin admin
> select bucket.repositoryname as repository,sum(size) as bytes from asset group by bucket.repositoryname order by bytes desc limit 10;

This command worked as expected but ever since I am facing various transaction errors while reading/writing or even fetching metadata from various repos. I host APT, docker, raw repos on Nexus.

com.orientechnologies.orient.core.db.OPartitionedDatabasePool$DatabaseDocumentTxPooled - $ANSI{green {db=component}} Error on transaction commit 570FD604
com.orientechnologies.orient.core.exception.OStorageException: Error during transaction commit
DB name="component"

First I sensed something wrong with permissions as persistent volume in on the host machine so I did chmod -R 775 <nexus-persistent-location> and chown 200:200 <nexus-persistent-location> but this didn't solve the problem.

Every now and then I have to REBUILD the indices using REBUILD INDEX *; command and then delete nexus pod for k8s to create a new one and that works for some time(4-7hrs). Any clues what may be wrong here.

https://redd.it/1iz7rgk
@r_devops