Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Do you have a list of project topics for POC-ing?

I would say that there are two types of PoC projects - super small, where you just write "Hello World" to a console, and slightly bigger one where you want to have a real topic behind the code.

For example, if I need a web service of some sort, my go-to project would be a pizza selector. Developers can have a list of pizzas available, and users can randomly select what pizza they want to order next time. I used that couple of times already and it is getting old :)

Do you have a similar type of project that is funny, somewhat useful and can be easily implemented/explained?

https://redd.it/1ivtlak
@r_devops
Icosic AI: Perplexity For Your Company’s Server Logs

Hello!

I'm Zuri, founder of Icosic AI, a startup based in San Francisco - we are Perplexity for your server logs.

The problem:

- searching through and filtering your logs using keywords is tedious at best

- semantic search is a step up, but still has no real intelligence regarding your query or your server logs

- engineers spend around 10 hours per week sifting through logs to investigate issues and uncover insights

The solution:

- Icosic AI is an intelligent search engine for your all of your company's server logs

- We use LLMs to intelligently understand your search query and intelligently understand all of your logs

- This gives you insights and answers that previously would take your engineers hours to uncover

- For example, a fintech company's engineer could ask "Why has there been a spike in transaction failures this morning?"

- Another example: "Tell me all instances where we got a high latency warning within 2 minutes of a transaction failure"

The time and cost savings:

- A typical example is a company with 100 engineers, where 20 of them each look through logs 10 hours a week to investigate issues and uncover insights and information

- If they're paid $70/hour, that's $70 * 10 hours * 4 weeks * 20 engineers = ~ $56,000 / month searching through logs. Our search engine does ALL of that for you.

More:

- You can integrate with your existing observability platforms like Datadog and Splunk to use logs that you've indexed there

- You can also just use logs that you've got on a cloud server somewhere at a specified path, for example /var/log/example.log

- You can use unstructured or structured logs, or both!

If you’re interested in finding out more, feel free to schedule a call with us from our landing page:

https://icosic.com

Also, you can start playing around with the product using our demo logs right away, no sign in required:

https://app.icosic.com

Feedback would be much appreciated!

What other integrations would you like to see? Let me know in the comments!

Thanks,
Zuri Obozuwa

https://redd.it/1ivx0db
@r_devops
Pipeline for dev containers to ecs?

Hey all! Just kind of thinking out loud here.

So I have pipelines etc in place that handle deployments to ecs. But these are tightly integrated with other services and I handle the deployments.

If I wanted to create a portal & pipeline where devs could enter the resource reqs and specify their repo / branch for a container image that’s built then deployed to a sandbox ecs env that has endpoints for common services and flexible network constraints. Is there any good resources to reference for this?

I feel like I’m excluding features and use cases I haven’t thought of that would be really cool here to improve the dev experience and give them some more autonomy in dev deployments. So any ideas, or similar setups you have and how you use it I’d love to hear about!

Cheers.

https://redd.it/1ivxly0
@r_devops
What are your biggest cloud infrastructure pain points?

Researching current cloud infrastructure setups and preferences across different teams. Interested in understanding:

• Which providers/tools teams are using
• How teams are handling multi-cloud
• Infrastructure costs and spending patterns
• Team size vs infrastructure complexity
• Deployment preferences and patterns

Quick 3-minute survey. Will share interesting trends and insights back with the community.

https://docs.google.com/forms/d/e/1FAIpQLSfadPrJIYpMpH8ETJKfITGc5sd4M3E-E6tnct6hC3a9lJ0DJQ/viewform



https://redd.it/1iw0sd2
@r_devops
How can I transition my career path to DevOps?

I started as an embedded software developer in March 2022 for automotive software development and was assigned to the microcontroller team. But most of my tasks revolved around software test automation scripting with Robotframework. I felt the lack of involvement in production development as 65% of my tasks are about writing, testing and deploying automation scripts.

I had the opportunity to assist my integration team since June 2024 as a temporary integrator (to end by June 2025). Basically Ive been assisting the team in automating as many processes to ease the software integration flow. I acquired exposure on Linux Yocto and docker utilisation along the way. There's alot more to learn and pick up for sure.

I have 2 questions :

1. What should I learn and pickup to be a
DevOps engineer?

2. Can I apply for DevOps roles elsewhere with
my current experience and motivation to
learn more?

https://redd.it/1iw0ggv
@r_devops
Question: ArgoCD for Dynamic Apps?

Hi,

I wanted to get some thoughts on an approach I'm thinking of. Say I have web apps with Helm charts for K8s deployment, and I want users to instantiate custom versions of these apps with their configuration e.g branding, title etc.

Does it make sense to store user configs in repos and then have ArgoCD sync that with the web app Helm charts via values.yaml? Whenever users change their custom configs, ArgoCD updates their deployments.

Are there other approaches/tools I should consider?

Thanks!



https://redd.it/1iw4e63
@r_devops
What should I do as a DevOps Intern, prepare for MNC's aptitude exams or for Certifications?

I am a final-year engineering student from a not-so-good college. Currently, I’m doing an internship at an AI startup as a DevOps/SRE intern. I’m happy with the job and the company, but I want to explore and learn more, preferably outside my state.

I have completed the AZ-104 Azure Associate certification and am preparing for the CKA and other DevOps-related certifications. However, as a fresher, I’m confused about whether I should focus on certifications or prepare for aptitude and coding tests for big MNCs like TCS, Infosys, Wipro, and IBM.

I personally prefer working in startups because I’ve seen that they offer great learning and growth opportunities. But all my friends and brothers are in big MNCs, and they suggest aiming for MNCs for job security, please guide me with your experiences what should I do.

https://redd.it/1iw4wjo
@r_devops
Production-Ready Coding: Best Practices for Developers

Hey all!
I wanted to share a quick list of my "rules of thumb" for the production-ready coding.

Basically, when you want to move from a hobby pet project to a real production application - what is needed?

For me, the list is simple:

0. Code must be compilable :)

1. Code must be readable for others. E.g. no 1-letter variables, good comments where appropriate, no long methods
2. Code must be tested - not necessarily 100% coverage, but "good" coverage and different types of tests to be available - unit, integration, end-to-end
3. Code must be documented. At least in the readme.md. Better if you have additional documentation available describing the architecture, design decisions, and contribution process.
4. Code must be monitored. There should be at least logs to standard output about errors that are happening and be able to track infrastructure metrics somehow.
5. Code must be somewhat secure. User input should be sanitized and something like OWASP top 10 should be checked
6. Code should be deployable via CI/CD tool.

What else would you add to the list?

And just in case, as a self-promotion, I added a video about this, describing those topics in a bit more detail - https://youtu.be/cdzrS-w\_bJo It would be great if you could like & subscribe :)

https://redd.it/1iwcdur
@r_devops
Are DevOps Under Job Threat?

Hello everyone.
I'm currently tagged as a DevOps Engineer having following experience:
Azure Webapp and VMs, Azure DevOps.
I'm having 4.2 YOE since I started my career in IT industry.
I don't have any kind of experience in K8s or docker or monitoring or jenkins or any other tools.

I want to know how much should I be afraid of this AI impact?
Should I change my domain from devops to data engineer or anything else?
Which DevOps Zone is AI impact proof(so that our job won't affeft much)

I'm really afraid and in panic mode right now as people are getting laid off and these CEOs and big companies are coming up new thing every week that AI will impact our job.
Please guys HELP ME!!

https://redd.it/1iwd4yz
@r_devops
Pull request testing on Kubernetes: vCluster for isolation and costs control

This week’s post is the third and final in my series about running tests on Kubernetes for each pull request. In the first post, I described the app and how to test locally using Testcontainers and in a GitHub workflow. The second post focused on setting up the target environment and running end-to-end tests on Kubernetes.

I concluded the latter by mentioning a significant quandary. Creating a dedicated cluster for each workflow significantly impacts the time it takes to run. On GKE, it took between 5 and 7 minutes to spin off a new cluster. If you create a GKE instance upstream, you face two issues:

Since the instance is always up, it raises costs. While they are reasonable, they may become a decision factor if you are already struggling. In any case, we can leverage the built-in Cloud autoscaler. Also, note that the costs mainly come from the workloads; the control plane costs are marginal.
Worse, some changes affect the whole cluster, e.g., CRD version changes. CRDs are cluster-wide resources. In this case, we need a dedicated cluster to avoid incompatible changes. From an engineering point of view, it requires identifying which PR can run on a shared cluster and which one needs a dedicated one. Such complexity hinders the delivery speed.

In this post, I’ll show how to benefit from the best of both worlds with vCluster: a single cluster with testing from each PR in complete isolation from others.

Read more...

https://redd.it/1iwhrz2
@r_devops
Help - Best way to interview SRE/DevOps

Looking for advice from anyone with experience as a hiring manager or interviewer for an SRE team.

I usually prefer candidates with some HackerRank coding experience, strong Linux administration, Kubernetes expertise, and networking fundamentals. If anyone can share their best practices for evaluating these skills, that would be great.

I need to validate candidates for the following skills:

* Linux Administration (hands-on with Ubuntu)
* Networking Concepts (L2/L3, OSI layers)
* Kubernetes Administration (on-prem)
* Programming - Python/Go (developer-level preferred, but not mandatory)
* Observability Stack (Prometheus, Grafana, Loki, VictoriaMetrics)
* AWS Proficiency
* Ansible (comfortable using it for automation)

Ideal Candidate would have 5 years of experience. Again I am only looking for feedback and tips in the interview process feel free to share your views.

https://redd.it/1iwjd7s
@r_devops
Looking for a DevSecOps Role - Remote


Hey folks! I'm looking for a DevSecOps role where I can leverage my skills in automation, security, CI/CD, and cloud infrastructure. Experienced in AWS, Kubernetes, Docker, Terraform, and security best practices. Also, have a strong background in SecOps, DevOps, and FinOps.

Open to remote opportunities! Feel free to connect or drop any leads. Cheers!

https://redd.it/1iwm5kh
@r_devops
Simplifying Infrastructure-as-Code with Our SaaS Solution

Imagine deploying powerful cloud infrastructurelike Google Cloud Storage or a full virtual machine without ever needing to write a single line of code or wrestle with complex tools. Our Software-as-a-Service (SaaS) application takes the headache out of Infrastructure-as-Code (IaC) and puts it into the hands of anyone, regardless of experience. Whether you're a small business owner, a startup founder, or a developer looking to save time, we make Google Cloud Platform (GCP) deployments effortless, secure, and scalable.

What We Offer

Our SaaS is built for simplicity and power:

No Expertise Needed: You don’t need to know Terraform, IaC, or even how GCP fully works. Just connect your GCP project, pick a service—like Google Cloud Storage—and hit "Deploy." We handle the rest.
Ready-Made Building Blocks: We maintain a library of pre-built Terraform modules (think of them as blueprints for cloud services) in our own GitHub repository. These are battle-tested and ready to go.
Personalized Deployment: Your infrastructure lives in your GCP project not ours. We use your authorized credentials to set everything up exactly where you want it.
Future-Proof Growth: Starting with services like Google Cloud Storage, we’re designed to easily add more GCP offerings as your needs evolve.

# How It Works: The Big Picture

Here’s what happens behind the scenes when you use our SaaS:

1. You Connect: Through a clean, intuitive interface, you link your GCP project to our app.
2. You Choose: Pick a service from our list-say, a secure storage bucket for your files.
3. We Deploy: Our system fetches the right Terraform module from our GitHub repo, customizes it for your project, and deploys it to GCP using your secure credentials. Done!

You get enterprise-grade infrastructure without the complexity.

# The Tech That Powers It

Frontend: It’s where you log in, connect your GCP account, and make selections.
Backend: They securely handle your authentication, fetch the Terraform modules, and execute the deployment process.
Terraform Magic: We store our predefined Terraform modules in a GitHub repository (saas-infra-modules). These are reusable scripts that define how services like Google Cloud Storage should be built in GCP. When you deploy, we tailor and apply them to your project.
Scalability: Our architecture is modular. Adding support for new GCP services—like Compute Engine or BigQuery—is as simple as dropping new Terraform modules into our repo.

# Authentication: How We Keep It Secure and Simple

Let’s talk about how we connect to your GCP project—because security and trust are non-negotiable. We use a standard called OAuth 2.0, the same technology you’ve likely used to log into apps with your Google account. Here’s how it works and why it’s safe:

1. Your Permission: When you connect your GCP project, our app redirects you to a Google login page. You sign in with your Google account—the one tied to your GCP project—and grant us permission to manage resources on your behalf. This happens in a secure, Google-controlled environment, not ours.
2. Limited Access: Google generates an OAuth token (a kind of digital key) that we use to act only within your project and only for the tasks you approve—like deploying a storage bucket. This token has an expiration date and can be revoked by you at any time through your Google account settings.
3. No Stored Secrets: We don’t ask for your GCP passwords or private keys. The OAuth token is temporary and encrypted, ensuring your credentials stay yours alone.
4. Our Side: To fetch our Terraform modules from GitHub, we use a Personal Access Token (PAT)—but that’s our key, not yours. It’s locked down to read-only access for our repo, keeping everything compartmentalized.

Think of it like giving a trusted contractor a keycard to renovate one specific room in your house. They can’t wander into other rooms, and you can take the keycard back whenever you want. That’s how we
authenticate and protect your project.

# Why This Matters to You

Time Savings: Deploying infrastructure that might take hours or days (and a hired expert) now takes minutes.
Cost Efficiency: No need to hire IaC specialists or spend weeks learning Terraform. Our SaaS is your shortcut.
Control: Your infrastructure lives in your GCP account, under your billing and ownership—not some third-party sandbox.
Security: With Google’s OAuth and our transparent process, you’re protected at every step.

# The Vision

Today, it’s Google Cloud Storage. Tomorrow, it’s Compute Engine, Kubernetes, or whatever GCP service you need. Our SaaS grows with you, simplifying the cloud so you can focus on your business—not the tech.

Ready to deploy your first service? Let’s connect your GCP project and get started—no coding required.

Simplifying Infrastructure-as-Code with Our SaaS Solution

If you found this service helpful, how much would you be willing to pay to use it?

If you’re interested in this service, please reach out to join our waitlist! When we launch, you’ll get one month of free usage.

https://redd.it/1iwne3x
@r_devops
Best practices on storing user-uploaded files in containerized environment

I’m working on a job board and have recently containerized our Next.js/Node.js application using Docker (deployed on AWS ECS). One big technical hurdle is handling user-uploaded files (resumes) in a containerized setup.

Currently I'm writing these files to the container’s filesystem---definitely not ideal! What's a clean & simple way approach to file storage that aligns with DevOps best practices. Specifically:

1. Persistent storage options: Which solutions work best for ephemeral containers? An NFS volume, EFS, or a cloud storage bucket (e.g., S3)?
2. Deployment pipeline integration: How do you usually handle storing or moving uploads during blue/green or rolling deployments?
3. Security considerations: Any recommended steps to ensure data integrity and secure transfer? (e.g., encryption in transit, SSE for S3, etc.)

Ty!

https://redd.it/1iwohig
@r_devops
Keycloak on EKS Failing to Mount AWS Secrets Manager Credentials

Hey folks,
I’m running Keycloak on an EKS (v1.27) cluster and having trouble mounting secrets from AWS Secrets Manager using the Secrets Store CSI Driver (v1.3.4). Both the Keycloak and PostgreSQL pods are stuck in a `CreateContainerConfigError` state with errors like:

Error: secret "keycloak-secrets" not found
csi-secrets-store-controller: file matching objectName [secret] not found in pod


Below are the relevant details of my setup:

# Environment

* **EKS version**: 1.27
* **Secrets Store CSI Driver**: 1.3.4
* **AWS Secrets Manager**: Verified the secrets exist
* **IAM Policies**: Node role and/or IRSA with `SecretsManagerReadWrite` policy

# SecretProviderClass

Here’s an excerpt (Terraform format) showing how I’m configuring my `SecretProviderClass`:

resource "kubernetes_manifest" "keycloak_secret_provider" {
manifest = {
apiVersion = "secrets-store.csi.x-k8s.io/v1"
kind = "SecretProviderClass"
metadata = {
name = "keycloak-secret-provider"
namespace = "my-namespace"
}
spec = {
provider = "aws"
secretObjects = [{
secretName = "keycloak-secrets"
type = "Opaque"
data = [{
key = "postgres-password"
objectName = "nonprod-secret-postgres_keycloak_auth"
}]
}]
}
}
}


# Pod/Deployment Snippet

Here’s a condensed example of how my Keycloak Deployment references the `SecretProviderClass`:

apiVersion: apps/v1
kind: Deployment
metadata:
name: keycloak
namespace: my-namespace
spec:
replicas: 1
selector:
matchLabels:
app: keycloak
template:
metadata:
labels:
app: keycloak
spec:
securityContext:
fsGroup: 1000
serviceAccountName: keycloak-sa # (Has IRSA or node role with Secrets Manager perms)
containers:
- name: keycloak
image: quay.io/keycloak/keycloak:21.1
volumeMounts:
- name: secrets-store
mountPath: /mnt/secrets
readOnly: true
# other container configs ...
volumes:
- name: secrets-store
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: keycloak-secret-provider


# What’s Happening

1. Pods fail to start with `CreateContainerConfigError`.
2. Logs/Events complain that `secret "keycloak-secrets" not found`.
3. `csi-secrets-store-controller` logs say `file matching objectName [secret] not found in pod`.

# Troubleshooting So Far

* **AWS Secrets Manager**: Confirmed the secret `nonprod-secret-postgres_keycloak_auth1` exists.
* **IAM Policies**: Verified the node role (or service account with IRSA) has `secretsmanager:GetSecretValue` and other necessary permissions.
* **Terraform**: No drift reported; everything else is applying cleanly.
* **Namespace Check**: Both the `SecretProviderClass` and Keycloak pods are in the same namespace (`my-namespace`).
* **Multiple Pod Restarts**: No change in error status.

# Potential Issues / Questions

1. **Permission Gaps?** Is there a hidden or additional permission needed for the node (or service account) beyond `SecretsManagerReadWrite`?
2. **Secret Sync vs. Ephemeral Mount?** Am I accidentally referencing a Kubernetes Secret (`keycloak-secrets`) that isn’t being created because I only set up ephemeral volume mounting?
* If I need a native K8s Secret, do I have to enable `syncSecret.enabled: true` in the SecretProviderClass?
3. **Name Mismatch?** Could there be a subtle naming or label mismatch in my code—`keycloak-secret-provider` vs. `keycloak_secrets` or a missing [`metadata.name`](https://metadata.name) or `namespace`?
4. **Volume Permissions?** Does `fsGroup: 1000` cause any issues with
how the CSI driver writes secret files?

# Additional Info

* **Logs**: I’ve checked the CSI driver logs in `kube-system` (or wherever it’s installed). They only say “file not found” which hints it can’t read or place the files in `/mnt/secrets`.
* **Secrets Manager Tests**: I can successfully `aws secretsmanager get-secret-value` from my workstation using the same IAM role to confirm the secret is accessible.
* **Terraform**: My `kubernetes_manifest` might need more explicit fields. But so far, I haven’t spotted an obvious misconfiguration.

# Key Things I’d Love Feedback On

* Has anyone run into this “file matching objectName not found” error with Secrets Store CSI on EKS?
* Is there a detail or annotation required to mount AWS secrets as ephemeral files under `/mnt/secrets`?
* Am I missing a step in the process of syncing the AWS Secret to a native K8s Secret if that’s what my app is expecting?

Any insights, especially from folks who have Keycloak + AWS Secrets Manager working in EKS, would be hugely appreciated. Thank you! I feel like I am between a rock and a hard place and have been going in circles with this.

https://redd.it/1iwnxmj
@r_devops
US cloud providers and Europe

Hi !
So i live in europe, and we all know about the actualities in the US. And a lot of company are talking about US cloud providers (that they should leave).
A lot of them are talking about RGPD(Personal data protection in EU) and about the fact that the US can have free access as the want to your data stored in ther servers (even hosted in EU).
What do you think about this ? Is Europe need to worry about this ?

https://redd.it/1iwqs76
@r_devops
Looking for a Devops/Data Engineer Job

Individual with 1 year 9 months industry experience at a MNC. Looking for a job to learn and grow more.



https://redd.it/1iwvf6r
@r_devops
Best practices for managing schema updates during deployments (both rollout and rollback)

Hello there,

while walking my devops learning path, I started wondering about the industry best practices for the following use case:

1. app container gets update from v1 to v2
2. database schema need to be upgraded (new table, new columns)
3. (I suppose) the app have all the migration SQL commands to do that on startup once it detects that the schema need to be changed
4. App is online, great
5. OUCH! Something went wrong. Let's roll back... two scenarios:
1. data has been added into the DB in the meantime, we need to save that data and merge it later
2. let's ignore new data, just revert back ASAP

What do you think about those two scenarios? Should the app be responsible for everything or is it a separate process, which isn't automatable ?

Thanks for any explanation.

https://redd.it/1iwy2su
@r_devops