Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Is 2025 CKA harder than it was before? (Rant)

I waited to post this for a few months.

For context, I started my Kubernetes journey fresh in September 2024, having minimal experience (only with docker and docker-compose, but no orchestration, but I have sys admin/devops experience). I went through whole KodeKloud course, I did all 70+ killercoda scenarios and scored 80% on my killer.sh attempt. I probably spent 120+ hours studying and practicing for this exam.

I took the exam the updated exam on 1st of March 2025, so I knew about the updates and I went over the additional stuff as well. I took multiple kodekloud mock exams, with mixed results. But I read a lot about how killer.sh is much harder than real CKA exam, so when I scored 80% on my practice attempt so I was pretty confident going into the exam (maybe I was just lucky that the killer.sh questions suited me).

When I started the exam, oh boy: flaged 1st, flaged 2nd, flagged 3rd... I think the first question I started solving was 7 or 8th. I could've written down with what exactly I struggled, but I felt it was much harder than killer.sh. I think I can navigate the K8s docs pretty well, but I know I had some Gateway API questions, but I feel the docs were non existent for my questions, then also why use helm, and not allow helm docs? I remember I had to install and configure CNI, but why would you allow the docs/github for it? Does every Certified Kubernetes Admin know this from top of their head? Even when there is an update? I know there was somethings such as resource limits on the nodes I could've had and studied better for.

So after 2hours, I scored 45% (probably better than 60-65% as I would be more angry at myself but also more confident for the retake).

So I wanted to ask some who did the exam before and retook is after the February update: Was the exam harder? Or am I just stupid?

By end of this month I want to start revising again and do the retake in July/August. Do you guy have any other resources than KodeKloud, killercoda and killer.sh? I'm buying a hertner vps and going to host something in K8s to get more real-life experience.

End of my rant.

Edit: I'm not time traveller, fixed


https://redd.it/1kkv3ua
@r_devops
MacBook or Mac Mini for DevOps?

Basically the title says. Currently working as a DevOps Engineer and looking for laptop / desktop something stable and smooth for personal use. Want to know that going for MacBook Air or Mac Mini is worth and long-lasting. And appreciate if anyone have suggestions other than these with specs :)

https://redd.it/1kkycog
@r_devops
The first time I ran terraform destroy in the wrong workspace… was also the last 😅

Early Terraform days were rough. I didn’t really understand workspaces, so everything lived in default. One day, I switched projects and, thinking I was being “clean,” I ran terraform destroy .

Turns out I was still in the shared dev workspace. Goodbye, networking. Goodbye, EC2. Goodbye, 2 hours of my life restoring what I’d nuked.

Now I’m strict about:

Naming workspaces clearly
Adding safeguards in CLI scripts
Using terraform plan like it’s gospel
And never trusting myself at 5 PM on a Friday

Funny how one command can teach you the entire philosophy of infrastructure discipline.

Anyone else learned Terraform the hard way?

https://redd.it/1kkzo2h
@r_devops
Discussion: Model level scaling for triton inference server

Hey folks, hope you’re all doing great!

I ran into an interesting scaling challenge today and wanted to get some thoughts. We’re currently running an ASG (g5.xlarge) setup hosting Triton Inference Server, using S3 as the model repository.

The issue is that when we want to scale up a specific model (due to increased load), we end up scaling the entire ASG, even though the demand is only for that one model. Obviously, that’s not very efficient.

So I’m exploring whether it’s feasible to move this setup to Kubernetes and use KEDA (Kubernetes Event-driven Autoscaling) to autoscale based on Triton server metrics — ideally in a way that allows scaling at a model level instead of scaling the whole deployment.

Has anyone here tried something similar with KEDA + Triton? Is there a way to tap into per-model metrics exposed by Triton (maybe via Prometheus) and use that as a KEDA trigger?

Appreciate any input or guidance!


https://redd.it/1kl1ctu
@r_devops
Hellp/suggestions needed USA - Devops Engineer Interview

Hello All ,

I recently applied to a company
the below was its job description , I am familiar with many concepts , but some how I am worried about the interview. I got a screening call and awaiting response

Can anyone please help with suggestions on where to focus more , expected questions and any other tips please

thanks in Advance

**Required Skills:**

* **3+ years work experience in a DevOps or similar role**
* **Fluency in one or more scripting languages such as Python or Ruby**
* **In-depth, hands-on experience with Linux, networking, server, and cloud architectures**
* **Experience in configuration management technologies such as Chef, Puppet or Ansible**
* **Experience with AWS or another cloud PaaS provider**
* **Understanding of fundamental network technologies like DNS, Load Balancing, SSL, TCP/IP, SQL, HTTP**
* **Solid understanding of configuration, deployment, management and maintenance of large cloud-hosted systems; including auto-scaling, monitoring, performance tuning, troubleshooting, and disaster recovery**
* **Proficiency with source control, continuous integration, and testing pipelines**
* **Championing a culture and work environment that promotes diversity and inclusion**
* **Participate in the team’s on-call rotation to address complex problems in real-time and keep services operational and highly available**



**Preferred Skills:**

* **Experience with Containers and orchestration services like Kubernetes, Docker etc.**
* **Familiarity with Go**
* **Understand cloud security and best practices**

https://redd.it/1kl3lgg
@r_devops
❤‍🔥1
Is current state of querying on observability data broken?

Hey folks! I’m a maintainer at [SigNoz\](https://signoz.io), an open-source observability platform

Looking to get some feedback on my observations on querying for o11y and if this resonates with more folks here

I feel that current observability tooling significantly lags behind user expectations by failing to support a critical capability: querying across different telemetry signals.

This limitation turns what should be powerful correlation capabilities into mere “correlation theater”, a superficial simulation of insights rather than true analytical power.

Here’s the current gaps I see

1/ Suppose I want to retrieve logs from the host which have the highest CPU in the last 13 minutes. It’s not possible to query this seamlessly today unless you query the metrics first and paste the results into logs query builder and retrieve your results. Seamless correlation across signal querying is nearly impossible today.

2/ COUNT distinct on multiple columns is not possible today. Most platforms let you perform a count distinct on one col, say count unique of source OR count unique of host OR count unique of service etc. Adding multiple dimensions and drilling down deeper into this is also a serious pain-point.

and some points on how we at SigNoz are thinking these gaps can be addressed,

1/ Sub-query support: The ability to use the results of one query as input to another, mainly for getting filtered output

2/ Cross-signal joins: Support for joining data across different telemetry signals, for seeing signals side-by-side along with a couple of more stuff.

Early thoughts in [this blog\](https://signoz.io/blog/observability-requires-querying-across-signals/), what do you think? does it resonate or seems like a use case not many ppl have?

https://redd.it/1kl6rhd
@r_devops
Is KodeCloud worth it?

I’ve been lurking here for awhile after getting handed a bunch of dev ops tasks at work and wanted to see if kode kloud is a good recourse for getting up to speed with docker, ansible, terraform and concepts like networking, ssl, etc.? Really enjoying this stuff but am finding out how much I don’t know by the day.

https://redd.it/1kl9qop
@r_devops
What should I do ?

Hello Everyone,

Long time lurker but now I’m asking questions. So I’ve been in DevOps coming up on 5 years and I’m trying to figure out is it time for a new AWS cert (architect professional ) or should I finally use my cybersecurity degree and get AWS Certified Security - Specialty or a high level security cert ? My thing is that I want to increase my $120k salary to be closer to $160k - $180k. I don’t want to go down in salary? What should I do ?

https://redd.it/1kl9ldv
@r_devops
How to handle buildkit pods efficiently?

So we have like 20-25 services that we build. They are multi-arch builds. And we use gitlab. Some of the services involve AI libraries, so they end up with stupid large images like 8-14GB. Most of the rest are far more reasonable. For these large ones, cache is the key to a fast build. The cache being local is pretty impactful as well. That lead us to using long running pods and letting the kubernetes driver for buildx distribute the builds.



So I was thinking. Instead of say 10 buildkit pods with a 15GB mem limit and a max-parallelism of 3, maybe bigger pods (like 60GB or so), less total pods and more max-parallelism. That way there is more local cache sharing.



But I am worried about OOMKills. And I realized I don't really know how buildkit manages the memory. It can't know how much memory a task will need before it starts. And the memory use of different tasks (even for the same service) can be drastically different. So how is it not just regularly getting OOMKilled because it happened to run more than one large mem task at the same time on a pod? And would going to bigger pods increase or decrease the chance of an unlucky combo of tasks running at the same time and using all the Mem.

https://redd.it/1klboue
@r_devops
Check out our blog post about AI SRE

https://www.icosic.com/blog/what-is-an-ai-sre

In this post we define the AI SRE and we outline its advantages and compare it to human SREs.

Thanks in advance for reading!

https://redd.it/1klbgjc
@r_devops
Every K8s Beginner’s Safety Net: --dry-run Explained in 5 Mins

Hey there, So far in our 60-Day ReadList series, we’ve explored Docker deeply and kick started our Kubernetes journey from Why K8s to Pods and Deployments.

Now, before you accidentally crash your cluster with a broken YAML… Meet your new best friend: --dry-run

This powerful little flag helps you:
\- Preview your YAML
\- Validate your syntax
\- Generate resource templates
… all without touching your live cluster.

Whether you’re just starting out or refining your workflow, --dry-run is your safety net. Don’t apply it until you dry-run it!

Read here: Why Every K8s Dev Should Use --dry-run Before Applying Anything

Catch the whole 60-Day Docker + K8s series here. From dry-runs to RBAC, taints to TLS, Check out the whole journey.

https://redd.it/1kleghk
@r_devops
Is Linux foundation overcharging their certifications?

I remember CKA cost 150 dollars. Now it is 600+. Fcking atrocious Linux

https://redd.it/1klf2ab
@r_devops
How to QA Without Slowing Down Dev Velocity:

At my work (BetterQA), we use a model that balances speed with sanity - we call it "spec → test → validate → automate."

\- Specs are reviewed by QA before dev touches it.

\- Tests are written during dev, so we’re not waiting around.

\- Post-merge, we do a run with real data, not just mocks.

\- Then we automate the most stable flows, so we don’t redo grunt work every sprint.

It’s kept our delivery velocity steady without throwing half-baked features into production.

How do you work with your QA?

https://redd.it/1klgayv
@r_devops
Where are people using AI in DevOps today? I can't find real value

Two recent experiments highlight serious risks when AI tools modify Kubernetes infrastructure and Helm configurations without human oversight. Using kubectl-ai to apply “suggested” changes in a staging cluster led to unexpected pod failures, cost spikes, and hidden configuration drift that made rollbacks a nightmare. Attempts to auto-generate complex Helm values.yaml files resulted in hallucinated keys and misconfigurations, costing more time to debug than manually editing a 3,000-line file.

I ran

kubectl ai apply --context=staging --suggest

and watched it adjust CPU and memory limits, replace container images, and tweak our HorizontalPodAutoscaler settings without producing a diff or requiring human approval. In staging, that caused pods to crash under simulated load, inflated our cloud bill overnight, and masked configuration drift until rollback became a multi-hour firefight. Even the debug changes, its overriding my changes done by ArgoCD, which then get reverted. I feel the concept is nice but in practicality.... it needs to full context or will will never be useful. the tool feels like we are just trowing pasta against the wall.

Another example is when I used AI models to generate helm values. to scaffold a complex Helm values.yaml. The output ignored our chart’s schema and invented arbitrary keys like imagePullPolicy: AlwaysFalse and resourceQuotas.cpu: high. Static analysis tools flagged dozens of invalid or missing fields before deployment, and I spent more time tracing Kubernetes errors caused by those bogus keys than I would have manually editing our 3,000-line values file.

Has anyone else captured any real, measurable benefits—faster rollouts or fewer human errors—without giving up control or visibility? Please share your honest war stories?

https://redd.it/1klgx3h
@r_devops
How to know if I'm suitable for an SRE/DevOps position

Hi folks

I've been a SWE for about 4 years now, and I'd consider myself a bit of a polyglot (fluent in lots of languages, front end to back end), and I've done a fair amount of work on the cloud and infrastructure side.

I'm curious if Reddit thinks I'd be capable of taking a job as an SRE or in DevOps based on my experience:
\- Built and managed several Kubernetes clusters (no managed services)
\- Built a multi-region, multi-vendor automated Kubernetes cluster deployer
\- Worked with Gitlab CI/CD to support releases for Spring Boot apps, various Node projects and more
\- Built and maintained image scanning pipelines (using trivvy and blackduck)
\- Managed terraform and ansible projects for deploying infrastructure in AWS (including all your usual suspects; EC2, RDS, etc etc)

Thanks!

https://redd.it/1klii7h
@r_devops
Self-hosted MySQL for production - how hard is it really?

I started software engineering in 2002, there was no cloud back then and we would buy physical servers, rent a partial rack in a datacenter, deploy the servers there and install everything manually, from the OS to the database.

With 10-15 servers we quickly needed someone full time to manage the OS upgrades, patches, etc.

I have a side project that's getting hit around 5,000 times per minutes uncached, behing the back-end sits a MySQL 8 database curently managed by DigitalOcean. I'm paying around $100 per month for the database for 4 Gb of RAM, 2 vCPUs and around 8Gb of disk.

Separately, I've been a customer of OVH since 2008 and I've never had real problems with them. For $90 per month I can have something stupidely better: AMD Ryzen 5 5600X 6c @ 3.7Ghz/4.6Ghz, 64GB of DDR4 RAM (can get 192Gb for only $50 extra), 2x 960GB of SSD NVMe Raid, 25Gbp/s private bandwidth unmetered.

My question: does any of you have practical experience these days of the work involved in maintaining a database always updated/upgraded? Is it worth the hassle? What tools / stack do you use for this?

Note: I'm not affiliate with either OVH nor DigitalOcean, the question is really about baremetal self-managed (OVH, Hetzner, etc.) vs cloud managed (AWS, DigitalOcean, Linode, etc.)

https://redd.it/1kljcuz
@r_devops
What’s one cloud concept you pretended to understand at first?

Let’s be real—cloud has a steep learning curve. In my first few months, I nodded along when people mentioned VPCs, but deep down I had no clue what was really happening under the hood.

I eventually had to swallow my pride, go back to basics, and sketch it all out on paper. It finally clicked, but man—I struggled before that 😅

What about you?
Was there a concept (IAM, subnets, container orchestration?) you “faked till you made it”?
Curious what tripped others up early on.

https://redd.it/1klk7qt
@r_devops
BPMN for DevOps?

I'm looking into using a BPMN tool (like Camunda) or engine (like Zeebe or something more OSS) to describe complex DevSecOps processes, and would love to pick your brain on this topic.

I'm somewhat surprised that BPMN is not the standard, and instead even the best tools only support DAG, or are just super dev friendly (e.g Temporal). Have you used BPMN for DevOps automation/orchestration?

My idea is to keep using GitLab CI for ... well ... CI, but that would end at building containers. Otherwise all the orchestration, including cross-project orchestration, integrating several tools (Datadog, Slack, etc...) would happen at the BPMN layer. (I'm still thinking to either use GitLab or Kubernetes Job when I need a longer running task, like a DB migration, but even that would be launched as part of BPMN.)

While I struggle finding people using BPMN for these tasks, I see more and more people using durable execution engines (e.g. Temporal) for it. If you were part of such a decision, would you mind sharing why you went one way or the other?

https://redd.it/1kllbaa
@r_devops
im finally a DevOps Engineer

5 years ago I had zero college, zero experience, no certifications, and no marketable skills coming out of the army. i set the goal for myself to become a DevOps engineer and today I did it.

got into IT with zero experience and one certification in 2020 when i got out of the army infantry.

first job was help desk, then sysadmin, then a couple tier 2/3 remote support positions including as a RHCSA at red hat. then i got a sysadmin position for my current company in August of 2023.

i worked my ass off. i have built full terraform/Terragrunt modules, deployment pipelines, and incident response tools for our clients, who are some of the biggest tech organizations in the world. google, zoom, red hat, Microsoft, etc... I do this across multiple cloud providers based on client needs. it's actually kind of shocking the amount of work we do at the level we do given the size of our team. I'm the only systems person and I get to touch infrastructure for large organizations on a regular basis.

today i got the email that i have officially been promoted to DevOps engineer.

im really proud of myself. I barely graduated high school because of my ADHD. I did well in the army but the violent environment was not good for my soul. college is very uncomfortable for me. I wasn't sure if I'd ever make a good living, let alone doing smart people stuff.

when I was getting into IT I looked for the most lucrative positions. then looked for the one that I thought seemed the most interesting and that was DevOps. now im a DevOps engineer.

I'm really proud of myself.



https://redd.it/1klp28x
@r_devops
Devops positions are harsh for mid-level

Hey buddies,

I have been in DevOps for 2 years, and in the tech industdy for roughly 3 years. I am not a senior yet, more of a mid-level working in a good company here in cyprus, but the thing is am not getting what I want. I mean, im trying to switch job as any normal human being looking for a change and my current company is pretty reputable and know in the market. I have 2 AWS certifications and the CKA, and my CV is a solid 99/100 on ATS reviewers. But still not getting in. All positions are looking for seniors, and this is killing me.
I mean, I am doing super good on interviews, always showimg a super nice energy and answering all technical questions with the best answers possible, I did more than 15 interviews this year, even reached the last stages with big companies like AWS, Exness... stuff like that, but bad luck is a curse. Always someone more experienced take the role. Or got filled internally, or the recruiter is a jerk... any tips?

https://redd.it/1klulbg
@r_devops
How did your "trial by fire" go?

Hey! I'm in my first DevOps gig and it's kicking my butt. I was told that our environment is pretty complicated. We have a pretty intricate project pipeline with tons of jobs, rules, and variables. I'm having a hard time keeping up. I'm in year one and most of the tech we are using is technically new to me. It's making me want to quit but there are pretty smart, intelligent, and PATIENT people that are taking me under the wing a bit. I don't want to disappoint them. And I'll admit, at this point it isn't interesting work to me but I feel like it only feels like that because I haven't got a firm grasp on it. I've been a sys engineer for 20 years and I feel like I started at the bottom again.

What was your trial by fire like?

https://redd.it/1klxi7a
@r_devops
❤‍🔥1