Reddit DevOps
268 subscribers
1 photo
31K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Are we heading toward a new era for incidents?

Microsoft and Google report that 30% of their codebase is written by AI. When YC said that their last cohort of startups had [95% of their codebases generated by AI](https://leaddev.com/hiring/95-ai-written-code-unpacking-the-y-combinator-ceos-developer-jobs-bombshell). While many here are sceptical of this vibe-coding trend, it's the future of programming. But little is discussed about what it means for operation folks supporting this code.

Here is my theory:

* Developers can write more code, faster. Statistically, this means more production incidents.
* Batch size increase, making the troubleshooting harder
* Developers become helpless during an incident because they don’t know their codebase well
* The number of domain experts is shrinking, developers become generalists who spend their time reviewing LLM suggestions
* SRE team sizes are shrinking, due to AI: do more with less

Do you see this scenario playing out? How do you think SRE teams should prepare for this future?

Wrote about the topic in an article for LeadDev [https://leaddev.com/software-quality/ai-assisted-coding-incident-magnet](https://leaddev.com/software-quality/ai-assisted-coding-incident-magnet) – very curious to hear from y'all on the topic.


https://redd.it/1kry990
@r_devops
Cannot get GitHub Actions build to work with protoc

I've got a Rust build that needs access to protoc (the Protobuf compiler). I set it up like this:

  build-test-deploy:
runs-on: ubuntu-latest

...

- name: Install protoc
run: sudo apt-get update && sudo apt-get install -y protobuf-compiler

- name: Test
run: |
which protoc
export PROTOC=/usr/bin/protoc


In addition, env has

env:
AWS_REGION: "us-east-2"
...
PROTOC: "/usr/bin/protoc"


'which protoc' outputs as expected: /usr/bin/protoc

Yet the build fails with this:

  Error: Custom { kind: NotFound, error: "Could not find `protoc`. If `protoc` is installed, try setting the `PROTOC` environment variable to the path of the `protoc` binary. To install it on Debian, run `apt-get install protobuf-compiler`. It is also available at https://github.com/protocolbuffers/protobuf/releases  For more information: https://docs.rs/prost-build/#sourcing-protoc" }


I'm kind of at a loss...

https://redd.it/1ks3i2a
@r_devops
Can Gitlab’s native ‘Dependency Proxy for packages’ feature replace the need for Sonatype Nexus?

Based on a developer's feedback, there's a clear need for an internal binary repository within our network to serve as a secure, controlled intermediary for external dependencies. We currently have the following issues:

1. Manual downloading, scanning, and internal placement of dependencies is time-consuming.

2. Current development workflows are being hindered by lack of streamlined access to dependencies.

3. We have no way to externally source NPM packages and NuGet packages into our environment without going through a tedious manual process.

I was looking at Gitlab’s documentation for the Dependency Proxy feature but there is no clear example of a user proxying the flavor of packages I am interested in the way you would during a build if you had Nexus or JFrog. YouTube videos around this feature are YEARS old by the way with no examples for doing this. I think we need Nexus so we can scan the proxied packages for vulnerabilities, but I would like to save cost using any workarounds in Gitlab (what we have) if that is possible.

This is apart of an ongoing effort to modernize multiple applications (running them as containers in a VKS cluster), but it doesn’t make sense to move on to this step if we have no central space for storing container images (I am aware each project in Gitlab can store container images at the project level), binaries, externally sourced dependencies that are scanned and other artifacts.

https://redd.it/1ks8718
@r_devops
What do you wish someone told you when you became a DevOps engineer?

Hello all,

What do you wish you knew when you got started in DevOps?

A tool you saw someone use every day that you adopted, a monitoring platform you switched too later than you should have in hindsight, a solution to a problem you didn't know you had, etc.

I recently got promoted internally from Systems Administrator to DevOps(yay!). I have a background in Linux/cloud administration.

I've basically been doing both systems administration and DevOps for a couple years for my company. Which means I haven't been able to do either as well as I would like.

We're bringing on a SysAdmin this week and I was moved to DevOps. So now I will have the space to do this job properly.

our stack is:
AWS:
\-ecs(fargate)
\-s3
\-guardduty
\-eventbridge
\-sns
\-route53
\-cognito
\-ecr
\-cloudwatch
\-IAM


DB:
\-mongodb atlas


Monitoring:
\-newrelic


Some things I have already identified:
I already know we need to lower our attack surface, I think we're leaving some things on the table with GH's automation(we already use GH but there's more stuff we could do with automatic tagging for issue tracking), Im planning on creating a web portal so my developers can turn on/off dev tenants as needed(ecs fargate + terraform + authenticated web portal via cognito with org SSO), and im planning on ramping up our underutilized new relic implementation and cloudwatch.

https://redd.it/1ks8f35
@r_devops
How do you avoid CI and CD unsync when using GitOps workflow like FluxCD?

Imagine situation: you push changes into the GitLab repo, docker build+push runs for 5 minutes. The FluxCD checks the repo for changes every 1 minute.

You merge a feature into the main, starting the CI/CD workflow of deploying to the production K8S. But the problem is that FluxCD is simply checking every 1 minute the repo for changes, and it triggers its deploy faster than the docker image building stage in the registry.

Is there a way to configure FluxCD to avoid such race condition of mismatched image build and deploy timings? Or should I make the FluxCD deploy only specific image hash, and bumping it to the new image manually?

https://redd.it/1ks9ola
@r_devops
FREE GitHub Advanced Security Certification

Just wanted to share a great free opportunity from GitHub for anyone

How it works:

Step 1: Complete 3 GitHub Skills courses (each ~1 hour)

Step 2: Submit the Completion Form
After finishing all three, fill out the official form to share your progress.
Deadline: May 31, 2025

Step 3: Take the Certification Exam
In June 2025, you'll receive a free voucher (worth $99) to take the GitHub Advanced Security Certification exam. If you pass, you'll earn an official GitHub certification to showcase your security skills!

I think this is a solid opportunity for anyone looking to boost their cybersecurity portfolio especially if you're interested in DevSecOps

Link:
https://maintainermonth.github.com/security-challenge

Don't forget to upvote :)

https://redd.it/1kscd08
@r_devops
Suggested resources for starting as a junior devops engineer

I’m starting as a junior devops engineer soon and was wondering if some people could point me to resources to help me get started. For background, I am currently a software engineer but in the robotics/automation field so the job I’m switching too is a role that will be relatively unfamiliar to me. I am good with Linux and python but haven’t used AWS systems or kubernetes which are what I will be working with. There will be on the job training but I don’t want to go in totally blind.

https://redd.it/1ksd89e
@r_devops
Where to apply for Internships and Jobs ?

Hey, I am a student in my final year exploring and learned DevOps , cloud, IaC, Development. I am currently applying for internships on internshala portal but I lack some skills mentioned in the requirements which I am working on right now .
I just wonder if anyone could recommend some best portals or sites to apply.

https://redd.it/1ksdoo9
@r_devops
Am I capable of junior DevOps Engineer roel with this experience ??

morphing personal info for safty

Experience:
Devops, Intern, company.ai - company networks Project January 2025 – present

• Implemented SigNoz for Kubernetes cluster monitoring, configured 30+ alerting mechanisms, and designed 5 types of dashboards for comprehensive metric visualization.

• Integrated Trivy (DevSecOps tool) with GitHub Actions, enabling automated security scans and identifying 15 high-severity vulnerabilities before deployment.

• Troubleshot Kubernetes clusters, leveraging ArgoCD and Helm charts with Horizontal Pod Autoscaling (HPA), resulting in a 25% improvement in deployment stability and optimized CI/CD pipeline efficiency

\---

Software Engineer, Intern, company2 Project June 2024 – August 2024

• Integrated NFT APIs with the frontend for dynamic asset displays, optimizing data retrieval, reducing redundant API calls by 70%, and improving API response times from 2-3s to 350ms.

• Configured Moralis and Infura for secure NFT transactions and blockchain interactions, achieving a 95% transaction success rate and reducing gas fees by 20% through smart contract execution (average execution time reduced from 4s to 2.5s)

\---
Skills :

Java, Python, NodeJS, HTML5, CSS3, Linux, SQL, Docker, Kubernetes, Git, CI/CD, Azure Cloud, AWS, Grafana, Prometheus, Signoz

\---

Projects

1. Fusion Linux - Linux Distribution for DevOps And Cloud Environments

• Automated ISO image creation and customization using live-build, Bash scripting and other configurations

• Implemented CI/CD pipelines (GitHub Actions/GitLab CI) for automated OS builds and testing, decreasing deployment time from 45 minutes to 20 minutes and improving build success rate to 98%..

• Enabled GPU passthrough for virtualized environments, improving computational performance by 90% for GPU-intensive workloads in virtual machines.

2. Infrastructure Monitoring and Vulnerability Scanning Suite | Signoz

• Monitoring solution using Signoz

• Configured 30+ custom alerting rules and developed 5 types of dashboards, improving system observability and reducing mean time to detect (MTTD) by 40%.

• Integrated Trivy for automated vulnerability scanning in containers and system packages, identifying 15+ high-severity vulnerabilities per scan and reducing security risks by 60%.

3. Cryptway | React Js, Rapid Api, Solidity, Ethereum, Vercel

• Developed a blockchain platform enabling users to create Ethereum wallets, send/receive Ethereum, and swap ERC-20 tokens, processing an average of 080+ transactions per day

• Migrated from Vercel to Azure Cloud for enhanced scalability and cost optimization, leveraging Azure Spot Instances to reduce infrastructure costs by 70% while maintaining performance.

\---

Achievements

• 1st Prize at Mumbai Hacks Hackathon (World’s Largest Generative AI Hackathon)

• Smart India Hackathon Finalist 2024

• 1st Prize at AI Spark (Hackathon)

\---

Certifications

• Microsoft Certified: Azure Fundamentals

• Microsoft Certified: Azure AI Fundamentals

az 104

and preparing for CKA

https://redd.it/1ksk7li
@r_devops
AWS project

I would like to make an AWS project that would basically help me explore what I like and what I don’t like. I’m pretty new to public clouds but I’ve got experience with onprem so the learning curve is not that steep. I was suggested to do something like an app to call taxis. Does anyone have any other project suggestions that would force me to not only write code, but also do infra, security and data management related things?

https://redd.it/1ksl38t
@r_devops
What's your favorite lightweight monitoring stack?

Prometheus feels a bit heavy for small projects. Any go-to minimal setups you like?

https://redd.it/1ksm95r
@r_devops
What would be a better middleware solution or tool we can use?

We are looking for a middleware solution or tool that connects to a server HTTP/WebSocket hosted in our AWS cloud and continuously streams real-time event/log data.
This middleware is hosted in the client’s cloud, has no public IP, and cannot be accessed outside. But it can access our system as ours is publicly accessible. It pulls data from our network.

So problem is we need a solution/tool that we can use that will ensure all data is being pulled/listened, processed (yes we need to process and post to other endpoint in client network), also we need a monitoring for that to view the data in/posted to that solution for better visibility.

https://redd.it/1ksn1o5
@r_devops
Calling Cloud/Cybersecurity Pros: Help My Thesis on Zero Trust Architectures

Hi everyone,

I'm conducting academic research for my thesis on zero trust architectures in cloud security within large enterprises and I need your help!

If you work in cybersecurity or cloud security at a large enterprise, please consider taking a few minutes to complete my survey. Your insights are incredibly valuable for my data collection and your participation would be greatly appreciated.

https://forms.gle/pftNfoPTTDjrBbZf9

Thank you so much for your time and contribution!

https://redd.it/1ksmsew
@r_devops
Loki giving a "Get - deadline exceeded" error

I have a containerized grafana monitoring stack with Grafana Alloy and Loki working over a tailnet, when I curl to https://mytailnet/loki/ready It works and I get a 200 OK message. However, when I try to use POST to loki, I get a 404 page not found, and the loki docker logs contain "caller=mock.go:150 msg=Get key=collectors/compactor wait_index=779

caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/scheduler

caller=mock.go:150 msg=Get key=collectors/scheduler wait_index=781

caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/ring

caller=mock.go:150 msg=Get key=collectors/ring wait_index=780

caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/distributor" can anybody help?

My loki.yaml is

auth_enabled: false  # Enable in production!



server:

  http_listen_address: 0.0.0.0  # e.g., 100.101.102.103

  http_listen_port: 3100

  grpc_listen_port: 9096

  http_server_idle_timeout: 40m

  http_server_read_timeout: 20m

  http_server_write_timeout: 20m

  log_level: debug

common:

  path_prefix: /loki-data

  storage:

filesystem:

chunks_directory: /loki-data/chunks

rules_directory: /loki-data/rules

  replication_factor: 1

  ring:

instance_addr: 127.0.0.1

kvstore:

store: inmemory

limits_config:

  allow_structured_metadata: false

schema_config:

  configs:

\- from: 2025-05-16

store: tsdb

object_store: filesystem

schema: v13

index:

prefix: index_

period: 24h

\#querier:

\#  engine:

\#    timeout: 15m

\#  max_concurrent: 512

\#  query_timeout: 5m

ingester:

  wal: 

  enabled: true

dir: /loki/wal

storage_config:

  tsdb_shipper:

active_index_directory: /loki-data/tsdb-index

cache_location: /loki-data/tsdb-cache

https://redd.it/1ksoh00
@r_devops
Why doesn't crt.sh show the latest Let's Encrypt cert under the base domain?

I noticed that when I query:
https://crt.sh/?q=DOMAIN.COM&exclude=expired&output=json
…it doesn’t include the latest certificate I just renewed via Let's Encrypt.

However, when I directly query the full subdomain, like:
https://crt.sh/?q=api.test.DOMAIN.COM&output=json
…the new cert (and its corresponding precertificate) appear immediately.

For example, the base domain query returns 4 entries, but the subdomain one returns 6 — the two extra entries are the new precert and the issued cert.

Is there a way to query the base domain and receive all subdomain certs (including the latest) without knowing every subdomain in advance?

https://redd.it/1kssjb1
@r_devops
I'm building an audit-ready logging layer for LLM apps, and I need your help!

**What?**

SDK to wrap your OpenAI/Claude/Grok/etc client; auto-masks PII/ePHI, hashes + chains each prompt/response and writes to an immutable ledger with evidence packs for auditors.

**Why?**

**-** HIPAA §164.312(b) now expects tamper-evident audit logs *and* redaction of PHI before storage.

\- FINRA Notice 24-09 explicitly calls out “immutable AI-generated communications.”

\- EU AI Act – Article 13 forces high-risk systems to provide traceability of every prompt/response pair.

Most LLM stacks were built for velocity, not evidence. If “show me an untampered history of every AI interaction” makes you sweat, you’re in my target user group.

**What I need from you**

Got horror stories about:

* masking latency blowing up your RPS?
* auditors frowning at “we keep logs in Splunk, trust us”?
* juggling WORM buckets, retention rules, or Bitcoin anchor scripts?

**DM me** (or drop a comment) with the mess you’re dealing with. I’m lining up a handful of design-partner shops - no hard sell, just want raw pain points.

https://redd.it/1ksuxb3
@r_devops
Rant - Companies are getting more and more entitled about job interviews

Did a quick recruiter screening Monday and a more technical interview on Tuesday and it went well so for the next "round" they sent me a 70 page document outlying an "assessment" that they want me to do before going further.

Requires me to set up an AWS account and provision a bunch of resources that don't fall under the free tier. Wtf? I asked them if they could just create an account for me to use, or if I can just create a local environment that mimics the AWS stuff as close as possible, they said no because part of the evaluation is how familiar I am with AWS. Like ok I'm familiar but I'm not trying to pay for a job interview.

I read over most of the documentation and the whole thing conservatively would take about 2 days to complete (accounting for you know... my actual life). I could probably do it all in one day if neglected all other responsibilities I have.

They gave me a deadline for Tuesday "to give me some time over the weekend." Whelp, Monday is a bank holiday and my family and I planned a vacation months ago (technically decades ago because we've been doing this same trip every year since I was a baby). We fly out early tomorrow morning and come back Monday night and today is mostly running last minute errands and driving about 3hrs to my cousin's house for the night because they live 20mins from the airport and our flight is at 6am and we're all on the same flight.

I got this assignment today at 10am.

I emailed them and politely explained the situation and that it's not going to work for me. Haven't heard back yet but I'm probably just gonna tell them I'm not interested anymore. This job market is exhausting.

https://redd.it/1kswd6v
@r_devops
When the CI pipeline breaks and the team asks, Did you change anything?

You could’ve sworn you didn’t touch anything. You check the logs. Nope. The error message just says, “undefined something something.” You sit there, staring at the screen like a confused raccoon in headlights. Meanwhile, your coworkers are asking if you broke it. Spoiler: You didn’t, but now it’s your problem. Welcome to DevOps!

https://redd.it/1ksww8t
@r_devops
To Flag or Not to Flag? — Second-guessing the feature-flag hype after a month of vendor deep-dives

Hey r/devops (and any friendly lurkers from r/programming & r/softwarearchitecture),

I just finished a (supposed-to-be) quick spike for my team: evaluate which feature-flag/remote-config platform we should standardise on. I kicked the tyres on:

LaunchDarkly
Unleash (self-hosted)
Flagsmith
ConfigCat
[Split.io](https://Split.io)
Statsig
Firebase Remote Config (for our mobile crew)
AWS AppConfig (because… AWS 🤷‍♂️)

# What I love

Kill-switches instead of 3 a.m. hot-fixes
Gradual rollouts / A–B testing baked in
“Turn it on for the marketing team only” sanity
Potential to separate deploy from release (ship dark code, flip later)

# Where my paranoia kicks in



|Pain point|Why I’m twitchy|
|:-|:-|
|Dashboards ≠ Git|We’re a Git-first shop: every change—infra, app code, even docs—flows through PRs. Our CI/CD pipelines run 24×7 and every merge fires audits, tests, and notifications.   Vendor UIs bypass that flow.  You can flip a flag at 5 p.m. Friday and it never shows up in git log or triggers the pipeline.  Now we have two sources of truth, two audit trails, and zero blame granularity.|
|Environment drift|Staging flags copied to prod flags = two diverging JSONs nobody notices until Friday deploy.|
|UI toggles can create untested combos|QA ran “A on + B off”; PM flips B on in prod → unknown state.|
|Write-scope API tokens in every CI job|A leaked token could flip prod for every customer. (LD & friends recommend SDK_KEY everywhere.)|
|Latency & data residency|Some vendors evaluate in the client library, some round-trip to their edge. EU lawyers glare at US PoPs. (DPO = Data Protection Officer, our internal privacy watchdog.)|
|Stale flag debt|Incumbent tools warn, but cleanup is still manual diff-hunting in code. (Zombie flags, anyone?)|
|Rich config is “JSON strings”|Vendors technically let you return arbitrary JSON blobs, but they store it as a string field in the UI—no schema validation, no type safety, and big blobs bloat mobile bundles. Each dev has to parse & validate by hand.|
|No dynamic code|Need a 10-line rule? Either deploy a separate Cloudflare Worker or bake logic into every SDK.|
|Pricing surprises|“$0.20 per 1 M requests” looks cheap—until 1 M rps on Black Friday. Seat-based plans = licence math hell.|



# Am I over-paranoid?

Are these pain points legit show-stoppers, or just “paper cuts you learn to live with”?
How do you folks handle drift + audit + cleanup in the real world?
Anyone moved from dashboard-centric flags to a Git-ops workflow (e.g., custom tool, OpenFeature, home-grown YAML)?  Regrets?
For the EU crowd—did your DPO actually care where flag evaluation happens?

Would love any war stories or “stop worrying and ship the darn flags” pep talks.



Thanks in advance—my team is waiting on a recommendation and I’m stuck between 🚢 and 🛑.

https://redd.it/1kszbs2
@r_devops
Is DevOps ADHD-Friendly work to do

I am php developer and recently I found out that I do not do well having to answer up for 2-3 teams calls. Also I get stressed and feel interogated upon codereviews. I suspect of ADHD and I am considering a career shift (but not yet fully commited).

In my personal projects I noticed I focus on automation and developing releasing rocedures, compared to the actual implementation od code. Therefore I am looking for a devops but the main problem is the same: I do not go well with communication especially on small teams.

So I wonder is this a setback in DevOps, usually most positions are either Cloud Engineer or SRE or a combination od DevOps and require an on-call rotation schedule. Therefore Idk if would be a better choice for me.

What do you reccomend?

https://redd.it/1kt0n33
@r_devops
Next.js deployment with CDKTF

Hi everyone!
I've decided to make "mega" project starter.
And stuck with deployment configuration.

I'm using terraform cdk to create deployment scripts to AWS, GCP and Azure for next.js static site.

Can somebody give some advice / review, am I doing it right or missing something important?

Currently I'm surprised that gcp requires cdn for routing and it's not possible to generate tfstate based on infra.
I can't understand, how to share tfstate without commit in git, what is non-secure.

Here is my [repo\](https://github.com/DrBoria/md-starter), infrastructure stuff lies [here\](https://github.com/DrBoria/md-starter/tree/master/apps/infrastructure)

It should works if you'll just follow the steps from readme.

Thanks a lot!

https://redd.it/1kt3lg7
@r_devops