Reddit DevOps
270 subscribers
8 photos
31.1K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
I am building a new CI tool what things should I keep in mind ?

If I were to build a new CI tool what are some things i should do which gives me competitive edge over others ?






https://redd.it/1er9cwm
@r_devops
Needing to run 4 web applications, each requiring only 0.25cpu 500mb ram, what's the most economical way on AWS?

I'm looking into various options to run 4 web applications, each requiring only 0.25 cpu and 500mb ram (or lesser even). Traffic is fairly low, less than 1k active users a month. Each application is merely running SPA + a node backend bundled with it. These applications also update very frequently (once or twice a day), it needs to automatically swap out the old, from code to a running application, without downtime, and without supervision.

Sure, I could setup a EKS cluster running solely on spot nodes + running multiple replicas of them to ensure spot termination interrupt doesn't create downtime. But even that, would cost me roughly $200 a month (guesstimate). Slap in argocd, image updater and a build pipeline, everything is handled for me without supervision.

Or I could spin up an EC2 instance, and have them all run in it, but these applications updates once or twice a day, I needed a way to have them deployed as soon as code is checked in to the repository, automatically. I don't feel like fiddling with webhook, SNS and lambda just to get it work.

Then I saw AWS Amplify, it can tracks code! and have them built as soon as there's code checks in and deployed automatically. But damn, they are buggy, I could not get those applications to work 100% on Amplify for some weird reasons I could not understand behind the scene.

Then I saw ECS with Fargate, seems promising, but the ability for me to automate builds and deploys from code to a running container is still questionable. I'm not sure if there's cost advantage comapred to running a full EKS + spot instances only (economical-wise).

I looked at other providers, like Digital Ocean and Vultr, they offer managed kubernetes control plane that cost $0, but damn their container registry cost a lot more than AWS ECR and has no lifecycle policy to automatically remove old images, which brings the cost very similar as though I'm doing the same on AWS.

Any idea how would you deploy these applications?

https://redd.it/1erbi8r
@r_devops
Traefik global redirect from www to non-www domain

I want to redirect all my containers - websites from https://www.mywebsite.com to https://mywebsite.com. Http to https redirect I already have. I have set up CNAME dns record to point www.mywebsite.com to my server's IP.

I had discussion with ChatGpt, but what it gave me doesn't work, it just loads https://www.mywebsite.com without a SSL certificate.

Here is my Traefik dynamic.yml configuration, what is missing to make it work? I want to apply this redirect globally in static or dynamic configuration without editing labels for each container.

This does redirect but www domain has no https certificate.

# dynamic configuration

http:
middlewares:
redirect-to-non-www:
redirectRegex:
regex: "^https?://www\\.(.*)"
replacement: "https://$1"
permanent: true

secureHeaders:
headers:
sslRedirect: true
forceSTSHeader: true
stsIncludeSubdomains: true
stsPreload: true
stsSeconds: 31536000

user-auth:
basicAuth:
users:
- '{{ env "TRAEFIK_AUTH" }}'

routers:
default-router:
entryPoints:
- web
- websecure
rule: "HostRegexp(`{host:.+}`)"
middlewares:
- redirect-to-non-www
- secureHeaders
- user-auth
service: noop-service
priority: 1

services:
noop-service:
loadBalancer:
servers:
- url: "https://0.0.0.0"

tls:
options:
default:
cipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
minVersion: VersionTLS12




https://redd.it/1ercmvj
@r_devops
Should I leave ?

Hey all, struggling with what to do with regards to my current role


My main issue is around a year ago a lot of the stuff which I would have been interested in has been abstracted away to managed vendors , from the management of our environments to the management of developer machines.

Anything network related is handled by either an internal network team or again our managed vendor

As such , there’s actually not much I have direct responsibilities over in any meaningful capacity.

I can feel my skills atrophying and it just feels like we’re secretaries for these other teams to tell them something is wrong, it really feels like just a glorified support role they slapped the name devops engineer on

We are barely involved in th development process for any new applications and don’t have much of any opportunities to practice anything

I’ve been trying to learn in my own time but it’s hard when you can’t utilise the skills in the work place

As someone who’s first job this is out of uni for 3 years in the role , In my scenario what would you do ?

https://redd.it/1erf1hm
@r_devops
I built a POC for a real-time log monitoring solution, orchestrated as a distributed system

A proof-of-concept log monitoring solution built with a microservices architecture and containerization, designed to capture logs from a live application acting as the log simulator. This solution delivers actionable insights through dashboards, counters, and detailed metrics based on the generated logs. Think of it as a very lightweight internal tool for monitoring logs in real-time. All the core infrastructure (e.g., ECS, ECR, S3, Lambda, CloudWatch, Subnets, VPCs, etc...) deployed on AWS via Terraform.

Feel free to take a look and give some feedback: https://github.com/akkik04/Trace

https://redd.it/1ergpf0
@r_devops
API Observability Guide: Enhancing Reliability & Performance


One of these guest blogs did a pretty good job covering API observability including the pillars of it, what it is, components, and implementation of it. There are also a few advanced techniques, and I thought it might be good to share it here as an educational resource.

Any additional techniques that we may have missed are welcome but no pressure.
https://www.getambassador.io/blog/api-observability-enhancing-reliability-performance

https://redd.it/1erghrs
@r_devops
Why is this happening

Suddenly started to face this problem while pressing Run Java of my Spring Boot App. If any of you beautiful souls faced it before, how did you work around it? I have this deadline i gotta fix this quick im sorry

The problem:
Failed to refresh live data from process

service:jmx:rmi:///jndi/rmi://127.0.0.1:45556/jmxrmi

after retries: 10

Source: Spring Boot Tools

https://redd.it/1erj3j1
@r_devops
DevOps lessons from building a global monitoring platform

Ever start a side project that spirals out of control? That's the story of my last year building UptimeCard, and I thought I'd share some DevOps war stories with you all.

It began innocently enough - just a simple uptime monitor. Fast forward, and I'm juggling a platform that's analyzing tech stacks for thousands of websites globally.

The first reality check hit when my cute little DigitalOcean setup choked at around 1000 monitored sites. Suddenly, I'm deep-diving into AWS documentation, trying to figure out how to scale this thing without breaking the bank. EC2, Lambda, DynamoDB - my new best friends and worst nightmares.

But here's the kicker - monitoring globally means dealing with, well, the globe. I naively thought I could run everything from a single region. You can't.

Then came the data deluge. Turns out, collecting and processing data from thousands of sites every minute is like drinking from a fire hose. I cobbled together a pipeline with Kinesis, and it's holding... for now.

Oh, and the irony of needing rock-solid monitoring for a monitoring service? Not lost on me. I've got CloudWatch alerts that would wake the dead. Because nothing says "professional" like your uptime monitor going down.

Infrastructure management became my nemesis. Started with manual setups (I know, I know), and quickly drowned in config hell. Terraform saved my sanity, but the migration was... let's call it character-building.

Security? A constant paranoia. When you're handling data from thousands of websites, every shadow looks like a potential breach. I'm now on a first-name basis with AWS's IAM documentation.

And let's not forget the cloud bill. I'm now a reluctant expert in auto-scaling groups and spot instances.

UptimeCard's at v1.0 now (https://uptimecard.com if you're curious), but it feels like I've aged a decade getting here. I'm sure there's still a ton to optimize.

So, what hard-learned lessons have you picked up from similar projects? Any tips for a battle-worn developer still figuring out this DevOps game?

I'm also toying with the idea of open-sourcing some of our DevOps scripts. Feels like it's time to give back to the community that's saved my bacon more times than I can count.

https://redd.it/1erjp83
@r_devops
Need Suggestions for Reducing Downtime During EKS Deployments

Hello everyone,

I could use some help or suggestions with a deployment issue we're facing.

Currently, we're deploying to EKS, using Atlas MongoDB, and storing some documents in S3. The challenge is that every time we deploy to production, we need to take the system offline, back up S3 (which takes about an hour due to a large number of files, even though the size is small), back up the database, then deploy and run the migration.

Does anyone have ideas on how we can reduce or eliminate this downtime?

https://redd.it/1erjuji
@r_devops
Resources to learn DevOps Project

Hi all,

Hoping you wonderful people can help.

I'm a project manager that moved into product management.

At present, I am product owner for Dynamics 365. One of the core issues we have faced has been single branching strategy. I'm currently in the process of moving us over fully onto Azure DevOps for us to automate testing and resolve the branching strategy allowing us to be more agile.

One area that I need help on is understanding how to use Azure boards, or the delivery plan section on DevOps.

Does anyone know any good, free content for me and my BA's to learn this?



https://redd.it/1erixho
@r_devops
What do you monitor on your servers?


We've been developing the BlueWave Uptime Manager for the past 5 months with a team of 7 developers and 3 contributors. As we move towards expanding from basic uptime tracking to a comprehensive monitoring solution, we're interested in getting insights from the community.

For those of you managing server infrastructure,

What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
Do you also keep tabs on network performance, processes, services, or other metrics?

Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.

What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?

https://redd.it/1erkhef
@r_devops
Exploring the 12-Factor App Methodology: A Blueprint for Building Scalable and Resilient Cloud-Native Applications

Hey everyone,

I wanted to share a comprehensive blog post I just published about the **12-Factor App methodology**—a set of best practices designed to help developers build scalable, maintainable, and resilient cloud-native applications.

If you're working with **DevOps**, **microservices**, or building applications that need to thrive in **cloud environments**, understanding and applying these 12 factors can be a game-changer. In the post, I dive deep into each principle, explaining how they contribute to building modern, robust applications. I've also included book recommendations for each factor to help you explore these concepts further.

**What you’ll find in the blog:**

* An overview of all 12 factors, from codebase management to treating logs as event streams
* Practical insights on how to implement these principles in your projects
* Book recommendations to deepen your understanding of each factor

If you're interested in improving your application development practices, I think you'll find this post valuable.

🔗 [https://medium.com/@srivatssan/the-12-factor-app-methodology-a-blueprint-for-modern-cloud-native-applications-c1aea2984bde?sk=e2e214a30f30be4dfe7495b5fc27c80a](https://medium.com/@srivatssan/the-12-factor-app-methodology-a-blueprint-for-modern-cloud-native-applications-c1aea2984bde?sk=e2e214a30f30be4dfe7495b5fc27c80a)



I'd love to hear your thoughts and any experiences you've had implementing the 12-Factor App principles in your work!

https://redd.it/1erthxd
@r_devops
What is best way to monitor lot of PC's health

My work place has lot of Lab systems which occasionally losses wifi network and goes offline. What is best way to monitor multiple PCs? I would like to monitor network connectivity, hard disk space availability.



https://redd.it/1eru418
@r_devops
Where and how do you store your environment vars / secrets.

Rn we are storing the env vars/ secrets in bitbucket (secrets are pulled and mounted).

Looking for a better options.

I found a few options such as HCP vault or AWS ssm parameter store. But still as a beginner, I'm stumbled on how it is done ???

https://redd.it/1erw27o
@r_devops
Aurora (MySQL) global database with global write forwarding.

We are using Aurora MySQL Global DB (east primary & west secondary). We have logic in gateway to route "read" traffic to geo based and "write" traffic to weighted i.e. east.

Question: Do you recommend using global write forwarding instead? Our application is read heavy if that matters and we do need performance (plus consistency, I know you can't have it all so maybe performance over consistency with lag of \~ milliseconds).

Reading some blogs say don't use global write forwarding? Is GW based routing that we have is good enough but its not truly Active/Active for our application either in that case. Should we do code based routing instead i.e. send read queries geo routed and write queries to weighted routes (Spring/JPA)?

Any suggestions or how you have implemented it would be helpful, thanks!

https://redd.it/1erwraz
@r_devops
CI/CD observability

Is your CI/CD pipeline slowing you down? Dive into the key steps and best practices to enhance your pipeline's visibility and performance using OpenTelemetry. Check out this blog: https://www.cloudraft.io/blog/opentelemetry-for-cicd-observability

https://redd.it/1ery0u3
@r_devops
Loggly alternative for centralized logs

I'm looking for an alternative to loggly. I have various .NET applications deployed across multiple locations, and I need them to send their logs back to a central server.

I've been experimenting with loggly and I’m already at the limit of their free plan, even in the testing phase. I was thinking about splunk since they offer the most similar feature set to Loggly, but it comes with significant limitations on data ingestion, especially in the Splunk Light version.

Does anyone have any recommendations? :)

https://redd.it/1ery93u
@r_devops
I started challenging our junior devs to provide feedback or ask at least one question while reviewing a PR. Thoughts?

Our JR devs are allowed to approve PRs (not my choice), and it's usually just a rubber stamp as they're nervous to call out a more senior member.

I requested they try to add something to the PR in terms of feedback just to help them get their feet wet and more comfortable.



https://redd.it/1es2ykc
@r_devops
We're reviewing a few CI/CD tools for our company and I'm curious about your experience with a couple.

Namely it looks like management is whittling it down to Travis CI or GitHub Actions. I've heard that Github Actions requires a lot more coding than Travis (this is a lot more important to me than the bean counters lol). If that's the case it sounds like there's a big argument there in terms of efficiency that may not be so easily quantified to various decision makers. Anyone?

https://redd.it/1es4h0a
@r_devops
Standard vs Express Step function

I don’t quite understand what do they mean by exactly once and atleas -once model respectively.If we can use a for loop and retry in standard workflow how is that exactly once then?!

https://redd.it/1es5brj
@r_devops
In your resume, do you put a lot of keywordd to pass CV screening or avoid it?

Hello!

In your resume, in order to pass the CV screening phase, often done by HR or even automatic tool, do you put a lot of technologies keywords? (Like list all the tech you work on only if it was for a low amount of time)

Or you avoid it in order to pass the hiring manager CV screening?

What is the good balance?

https://redd.it/1es72ff
@r_devops