Reddit DevOps
270 subscribers
8 photos
31.1K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Organizing & minimizing cloud costs in AWS

We're running a bunch of workloads on AWS in different accounts. Every now and then (usually when we have a big spike in expenses), we find ourselves trying to figure where our main expenses are coming from, what kind of workloads are currently running and wasting money and whether we have redundant workloads that we should get rid of.

In general, we are trying to constantly add tags to workloads and educate the team to add the relevant tags to any workloads they start (often developers starting EC2 machines, snapshots, Sagemaker pipelines, S3 buckets etc.).

Needless to say that sometimes people don't add tags at all, or do not add the appropriate tags. Sometimes people forget their expensive instances running idle during the weekend etc.

How do you guys handle monitoring your workloads (what asset belongs to what project), expenses, reducing redundant workloads to a minimum and generally keeping a good hygiene environment where a lot of money is not spent unnecessarily

https://redd.it/1eqf1h3
@r_devops
New to Devops - How do I find where our grafana instance is installed in our EKS cluster?

Good day folks. I was tasked with troubleshooting a grafana-loki issue but I don't know where to start. I looked at our console and tried to verify that the loki data connection was good to go but it isn't. It can't call the resource. I was told that once upon a time it worked and then stopped working a few weeks ago. I didn't configure grafana or loki myself so I don't know the details.

At this point I am just trying to find where the grafana/loki configuration is located. The lead for our particular section of the project is out sick so. And even when he gets back, I hate asking him stuff like this because I get the notion that 1. either he feels like I should know it already or 2. He just hates being bothered. He never voiced this but his tone isn't really inviting lol.

I have been a systems admin for quite a while and I just this year got the opportunity to get deep into Devops. So, sorry if my responses aren't as educated as one would expect lol. Our environment seems very intricate and not only me, but a few other new hires with over 10-15 years in IT are even saying the way these guys are going about getting us accustomed to this environment isn't optimal lol.


Thanks in advance.

https://redd.it/1eqgonz
@r_devops
Where to store Py script run as part of GH Actions Workflow?

I have a Github Actions workflow which orchestrates terraform resources across a few different platforms. One step of the process is running a Py script which queries one of our platforms for a few key pieces of info before appending them to tfvars. Currently that script lives in the module root folder. This is part of a template which is cloned to create multiple services, so that means each repo has it's own copy of the script - probably a bad practice, we'll have to update each individual script in each repo if we ever make changes. What is the best way to make this script available to Workflow as a single source of truth?

https://redd.it/1eqgpp8
@r_devops
Sharing a Kubernetes + Azure Key Vault Integration Guide - Feedback Welcome!

Hey r/DevOps community,
I've been working on integrating Kubernetes with Azure Key Vault using OIDC, and I thought I'd share what I've learned in case it's helpful to others.
I've put together a detailed video guide (about 2 hours long) covering:
Setting up a K3d cluster
Establishing OIDC trust between Kubernetes and Azure Key Vault
Implementing the External Secrets Operator
Practical secret management in a production-like environment
Here's the link: https://youtu.be/JFJJWB7neIg?si=auHt3HF0wqZT5ZC7
I'm not here to promote myself, just to share knowledge. I've learned so much from this community, and I hope this can give something back.
If you do check it out, I'd be incredibly grateful for any feedback, corrections, or suggestions for improvement. There's always more to learn, and I'm sure many of you have tackled similar challenges in different (probably better) ways.
Some questions I'm particularly interested in:
Have you faced any specific challenges with Kubernetes-Azure integration that aren't covered here?
Are there any best practices or security considerations you think are crucial for this kind of setup?
How do you handle secret management in your organizations?
Whether you watch the video or not, I'd love to hear your thoughts and experiences on this topic. Thanks for being such a great community for learning and sharing!

https://redd.it/1eqj2gl
@r_devops
EKS for dev teams

I got a task to build EKS cluster for software developers. While EKS setup is clear - i got a question, what would developers prefer for deploying their stuff? (Including observability, logs, etc). I am looking at stuff like ArgoCD - but heard about it also not so favorable comments. So prefer pure pipelines, but still seeing “something” in cluster in my opinion is nice.

https://redd.it/1eql1s9
@r_devops
How to make my app reachable from outside through URL

Hello, I applied one of free bootcamp which is giving job opportunities and they sent to me use case, they wish for me created api services which is being get health endpoint to check app health. Then create docker file to push my app to docker hub. After that set up k8s includes rule that if health endpoint is failed, make application restart. I did all till that point.

But they also wish that make app to reachable through url. I can use native load balancer or tools like nginx.

I don't know how to make it since I don't have DNS in the hand.

I would be glad to hear your advice.
Thank you

https://redd.it/1eqniti
@r_devops
Using work provided sand-box for learning? At odds here..

I have access to an Azure and AWS provided sand-box , as a total newb and junior, I need to upskill. I see where I am lacking and it involves, AWS , azure, cloud based skills. Which employer knew.


My only concern is that, I want to start using it, but deploying sample websites , to learn PKI, learn certificate installs , buying domains etc. Only concern with that is….1. Buying things,(domains, etc) is not cheap. 2. I was told its for learning purposes and to kill off anything Which i wont be using ( was also planning to self learn terraform this way - to help destroy any infrastructure I create.)

Is there anything bad with inherently learning how to do this, I would for 1, not be establishing a personal website or the like, but i would like to learn how to deploy it. Since this account is being paid for or would i be better off, mocking something we have in our DEV or test environment to learn from in my own sandbox.

Never really had one, so seeking some input on that

https://redd.it/1eqpfxy
@r_devops
created a terraform for faster docker builds using a remote buildkit instance

Hey folks, I know most of you use Docker in your CI/CD pipelines. Slow Docker builds are so annoying and frustrating—we’ve all been there!

I created this open-sourced repo, https://github.com/useblacksmith/remote-buildkit-terraform, which contains a terraform config to quickly spin up and configure a remote BuildKit instance in aws that caches docker layers and substantially speeds up docker builds.

It is not perfect and wouldn’t work for large engineering teams, but it could really help many folks here.

Feel free to use it and let me know what you think.

https://redd.it/1eqr954
@r_devops
Are we having a good use case for k8s jobs?

Hello,

we are looking at optimising our kubernetes workloads. The cluster's are hosted on AWS EKS.

For reference a small overview of how a usual java/python service works in our cluster:

We are using AWS step functions to create a message in SQS, our pods are constaly checking its appropriate queue. If there is a new message it will perform the task. Based on queue size we are scaling the pods, to be able to handle higher traffic. As HPA we are using zalando adapter for metrics server Github.

So far this works quite well. However most of our services are not often triggered, this means we have a lot of pods just running without doing anything.

To better use our resources, we thought about migrating some of these services from pods to jobs. If a new message is sent to a queue, it will trigger a kubernetes job (looks like KEDA could be used for this). And the service will perform its task and then the job gets terminated.

Would this be a good use case for kubernetes jobs or are you recommending to look at other approches?

Thanks!

https://redd.it/1eqnxup
@r_devops
Should we CI/CD on production

Yesterday, my colleague told me that he didn’t think implemented ci/cd on production environment was a good idea. Since it could accidentally made something wrong and out of control. He suggested that we should deployed production manually, what do you guys think about it, please let me know

https://redd.it/1equmsf
@r_devops
Immutable VM image bakery companies?

What companies create hardened immutable VM images? For containers/docker images ChainGuard seems to be the front runners. Do any companies focus on VM images?

https://redd.it/1eqv2sm
@r_devops
Attempting a Website Builder

Hey everyone. Im attempting to build a website builder (targeting low traffic sites).

my plan was to deploy a single VM initially and run multiple containers on it. i.e backend, frontend, reverse-proxy, certbot for the app/builder and have a service like vercel/netlify handle all the domains and deployment of users websites.

but i had the bright idea of what if i have a go of it myself and learn more dev ops on the way. What do i need to know to build the devops side of a website builder....

I thought at first should i try run everything on a single VM to reduce costs initially and scale vertically and worry about getting it get scale horizontally across multiple VM later. (I know its a single point of failure).

Am i crazy for even thinking a website builder can operate without kubernetes?

currently i have a cd ci pipeline. with infrastructure managed by terraform. and ansible configuring my vm and pull and run my docker images.

any direction or thoughts would help. I am fairly new to dev ops, so sorry if my explainations aren't clear.

many thanks.

https://redd.it/1eqyhw6
@r_devops
See the cost of your Terraform in IntelliJ IDEs, as you develop it

Hi, my name is Owen and I recently started working at Infracost (YC W21 batch) (https://infracost.io). Infracost shows engineers how much their code changes will cost on the cloud before it gets deployed. For example, when an engineer changes a cloud resource (like an AWS virtual machine), Infracost posts a comment in CI/CD telling them "This change is going to increase your costs by 25% next month from $500/m to $625/m".

Previously, I was one of the founders of tfsec, the code security scanner; I quickly realised that identifying issues in your code (especially infrastructure code, i.e. Terraform) as soon as possible was the best defence. A lot of the principles of code scanning for security misconfigurations translate well to identifying cost impact. Many times, people are surprised by how cloud resources are priced and how expensive they can be. It is also really unfair that engineers are never given a ‘checkout screen’ when buying infrastructure, and then are blamed for breaking cloud budgets.

I believe engineers should have access to key information about cloud costs at the time of writing the code. So, I spent some time and built an Infracost plugin for the IntelliJ family of IDEs (https://plugins.jetbrains.com/plugin/24761-infracost).

With this plugin installed, as you develop your Terraform code, you will get the cost impact of your current project, and quickly see where the expensive resources are hiding in your code (just hit save & it will recalculate). Two main use cases I’m thinking of:

As you change resources, you can see the cost impact. For example, I increased the instance size from my Dev to Prod environment to handle the prod-sized workloads, and I can see the increase costs.
Comparing costs: I can copy + paste blocks of code and see the cost impact of using different configuration options, like removing multi-AZ options from test environments etc. I can see I save a few thousand dollars per year that way immediately.

You can still use Infracost in GitHub/GitLab to automate the cost analysis in CI/CD, and check for best practices, and the IDE tools will help you spot the issues sooner.

I’d love to get your feedback on this. I want to know if it is helpful, what other cool features we can add, and how it can be improved. Also if you spot any issues or bugs, let me know!

Here is how to install it: https://plugins.jetbrains.com/plugin/24761-infracost

I've done a demo video to get you started too - https://www.youtube.com/watch?v=kgfkdmUNzEo

https://redd.it/1er7966
@r_devops
Take control over GitHub repositories through leaked secrets in artifacts

New research shows how organizations tend to embed secrets in GitHub Actions workflow artifacts, mainly GitHub tokens. While the GITHUB_TOKEN is invalidated as soon as the job is complete, it's still possible to track the artifact upload and utilize the token to push code to the repository before the job is done.

Issue was found in highly-popular open source projects, owned by Google, Microsoft, AWS, Red Hat, Canonical (Ubuntu), OWASP, and others.

https://unit42.paloaltonetworks.com/github-repo-artifacts-leak-tokens/

https://redd.it/1er8x0j
@r_devops
I am building a new CI tool what things should I keep in mind ?

If I were to build a new CI tool what are some things i should do which gives me competitive edge over others ?






https://redd.it/1er9cwm
@r_devops
Needing to run 4 web applications, each requiring only 0.25cpu 500mb ram, what's the most economical way on AWS?

I'm looking into various options to run 4 web applications, each requiring only 0.25 cpu and 500mb ram (or lesser even). Traffic is fairly low, less than 1k active users a month. Each application is merely running SPA + a node backend bundled with it. These applications also update very frequently (once or twice a day), it needs to automatically swap out the old, from code to a running application, without downtime, and without supervision.

Sure, I could setup a EKS cluster running solely on spot nodes + running multiple replicas of them to ensure spot termination interrupt doesn't create downtime. But even that, would cost me roughly $200 a month (guesstimate). Slap in argocd, image updater and a build pipeline, everything is handled for me without supervision.

Or I could spin up an EC2 instance, and have them all run in it, but these applications updates once or twice a day, I needed a way to have them deployed as soon as code is checked in to the repository, automatically. I don't feel like fiddling with webhook, SNS and lambda just to get it work.

Then I saw AWS Amplify, it can tracks code! and have them built as soon as there's code checks in and deployed automatically. But damn, they are buggy, I could not get those applications to work 100% on Amplify for some weird reasons I could not understand behind the scene.

Then I saw ECS with Fargate, seems promising, but the ability for me to automate builds and deploys from code to a running container is still questionable. I'm not sure if there's cost advantage comapred to running a full EKS + spot instances only (economical-wise).

I looked at other providers, like Digital Ocean and Vultr, they offer managed kubernetes control plane that cost $0, but damn their container registry cost a lot more than AWS ECR and has no lifecycle policy to automatically remove old images, which brings the cost very similar as though I'm doing the same on AWS.

Any idea how would you deploy these applications?

https://redd.it/1erbi8r
@r_devops
Traefik global redirect from www to non-www domain

I want to redirect all my containers - websites from https://www.mywebsite.com to https://mywebsite.com. Http to https redirect I already have. I have set up CNAME dns record to point www.mywebsite.com to my server's IP.

I had discussion with ChatGpt, but what it gave me doesn't work, it just loads https://www.mywebsite.com without a SSL certificate.

Here is my Traefik dynamic.yml configuration, what is missing to make it work? I want to apply this redirect globally in static or dynamic configuration without editing labels for each container.

This does redirect but www domain has no https certificate.

# dynamic configuration

http:
middlewares:
redirect-to-non-www:
redirectRegex:
regex: "^https?://www\\.(.*)"
replacement: "https://$1"
permanent: true

secureHeaders:
headers:
sslRedirect: true
forceSTSHeader: true
stsIncludeSubdomains: true
stsPreload: true
stsSeconds: 31536000

user-auth:
basicAuth:
users:
- '{{ env "TRAEFIK_AUTH" }}'

routers:
default-router:
entryPoints:
- web
- websecure
rule: "HostRegexp(`{host:.+}`)"
middlewares:
- redirect-to-non-www
- secureHeaders
- user-auth
service: noop-service
priority: 1

services:
noop-service:
loadBalancer:
servers:
- url: "https://0.0.0.0"

tls:
options:
default:
cipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
minVersion: VersionTLS12




https://redd.it/1ercmvj
@r_devops
Should I leave ?

Hey all, struggling with what to do with regards to my current role


My main issue is around a year ago a lot of the stuff which I would have been interested in has been abstracted away to managed vendors , from the management of our environments to the management of developer machines.

Anything network related is handled by either an internal network team or again our managed vendor

As such , there’s actually not much I have direct responsibilities over in any meaningful capacity.

I can feel my skills atrophying and it just feels like we’re secretaries for these other teams to tell them something is wrong, it really feels like just a glorified support role they slapped the name devops engineer on

We are barely involved in th development process for any new applications and don’t have much of any opportunities to practice anything

I’ve been trying to learn in my own time but it’s hard when you can’t utilise the skills in the work place

As someone who’s first job this is out of uni for 3 years in the role , In my scenario what would you do ?

https://redd.it/1erf1hm
@r_devops
I built a POC for a real-time log monitoring solution, orchestrated as a distributed system

A proof-of-concept log monitoring solution built with a microservices architecture and containerization, designed to capture logs from a live application acting as the log simulator. This solution delivers actionable insights through dashboards, counters, and detailed metrics based on the generated logs. Think of it as a very lightweight internal tool for monitoring logs in real-time. All the core infrastructure (e.g., ECS, ECR, S3, Lambda, CloudWatch, Subnets, VPCs, etc...) deployed on AWS via Terraform.

Feel free to take a look and give some feedback: https://github.com/akkik04/Trace

https://redd.it/1ergpf0
@r_devops