Reddit DevOps
269 subscribers
4 photos
31K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Devops Engineer here, unsure about future

Hi Everyone,



I’ve been working in the DevOps field for about four years, focusing on tools such as Jenkins, Terraform, Kubernetes, and Docker, primarily within Google Cloud Platform. As I look to expand my skill set, I’m considering exploring new areas such as security or data. I’m interested in hearing your thoughts on which direction might be most beneficial for future growth and how best to get started. Any suggestions or advice would be greatly appreciated!



Thank you!

https://redd.it/1ekifox
@r_devops
How do I get use Tetragon to get notifications when someone made some actions in our environment?

When I started testing Tetragon I imagined I'd be able to get alerts when someone kubectl exec'ed into a pod and did some things, but it seems like it's not as straight forward.

Tetragon seems to expose a few metrics that I thought would help, like tetragon_events_total or tetragon_policy_events_total, but both don't provide any information on what command was executed.

For example, following their setup docs I was able to run cat /etc/shadow which got a SIGKILL, and that event shows up in the above metric, but I don't see how I'm utilizing this information to get alerts.

Am I doing this wrong? How did you implement this or a similar eBPF tool in your environment?

https://redd.it/1ekj0cq
@r_devops
How Do You Prefer to Use a CLI Tool?

Hey everyone!

I just made a migration tool that helps you move from Nexus and Artifactory to my new platform, RepoFlow. It's all in TypeScript, and I’m trying to figure out the best way to make it available for everyone.

How do you like to install CLI tools? Would you prefer:

An npm package?
A Docker image?
A yum package?
Or should I just open-source it and let you run it straight from the code?

https://redd.it/1ekl36t
@r_devops
MMORPG/Games streaming architecture help

Hi all,


Lately I have been fairly curious about how do most MMO games/games streaming services like XBOX Game Pass' infrastructure look under the hood, how sessions are managed, server provisioning/scaling etc. Unfortunately, I was able to find little to no reference architecture on that regard. Do you know of any good references/projects/books/articles etc. I could look into so I can get a better view of how they work under the hood?

Thanks in advance!

https://redd.it/1ekkz15
@r_devops
Junior fullstack developer -> appsec or devops?

Hello, I was wondering what is a more natural career progression for a junior fullstack developer working on a web app? As part of my job I have very limited interactions with the ci cd pipepline in Azure DevOps and I was curious to get to know more about it.

This got me a little interested in DevOps and I was wondering if this was a natural career progression to take? I was also very curious and interested about Appsec as I've I'm also interested in cybersecurity as I do reverse engineering as a hobby (but not reverse engineering malware or anything like that) and I was told that was a valuable skill for Appsec.

As a junior fullstack webdev, what would be a more natural career or even lucrative progression for someone interested in both DevOps and Appsec? I imagine I only have time to go in one direction, right?

https://redd.it/1ekmqt6
@r_devops
What is the best Git branching strategy for managing Ansible CIS (hardening) roles?

We currently have one AWX server and a Gitlab instance in our environment to develop and test automation. I was tasked with testing the roles as a proof of concept for multiple OSs/applications (MS SQL servers, web, RHEL 7-9, etc). Once we knew the roles worked and we were satisfied with our compliance results, our lead said that we needed to build an automated testing process to ensure code quality. We ended up building something that ideologically works in theory, but would probably be a disaster to manage in practice unless I can guarantee that our pipeline process is forcefully rigid.

To manage inventory, they put each ENV:OS type in its own file. For example, we have a Dev<type>Server.yml, Test<type>Server.yml, Prod<type>Server.yml, and the same pattern of .yml files in this one repository for any other type of server (RHEL, SQL, etc) you can think of. Why did we do this? We did this because we thought we could not keep the inventory file the same in the repository that the role lives in, because we have 3 separate branches for each environment. So now, I am able to keep each hardening deployment separated, because there is a .CI file that essentially forces an upstream code promotion pattern as commits are made, linted against, tested in the corresponding environment and merged to the next branch.

But there is literally an inventory file for each environment per OS/server function type living in a separate repository. Each inventory file corresponds with an inventory object in AWX which we correlate to a job template. When a developer makes a commit to the development branch in role’s repository, we trigger AWX’s API to launch the development job template (after linting the commit in development branch of role). If the development job template runs successfully in AWX, the pipeline creates a MR, randomly assigns the MR to some reviewers so we can build an audit trail then the next merge will restart the same .CI process but for the upstream environments.

This works fine in theory, but I foresee an event where we have TONS of job templates for the same role but in each environment in our Ansible server. I am also wondering how we are going to treat each application’s hardening process different. For example, I think all application teams who use RHEL servers should use a golden hardened image before they even build their app on top, because we are starting to see issues occur when we harden a system that belongs to another team and they say the server is unreachable or something breaks. Having a separate version of the role for each team to satisfy each application sounds horribly unmanageable. I just don’t see how I can maintain separate environment, for each server type, FOR EACH SEPARATE TEAM.

https://redd.it/1eknzsl
@r_devops
Greetings fellow newly unemployed people. How can we apply to jobs more efficiently?

A lot of the popular auto-complete forms are absolute trash. There must be a better way.

https://redd.it/1ekoef4
@r_devops
Branching strategy and environments.

I'm a little confused about how branching strategies related to environments for developing, testing and production, can someone explain to me how they do it in practice?

https://redd.it/1ekq3de
@r_devops
Supercharge Monorepo CI/CD: Unlock Selective Builds

Hey DevOps community,

I've been battling with slow CI/CD pipelines in our monorepo setup for months, and I finally found a solution that's been a game-changer for us. Thought I'd share in case anyone else is pulling their hair out over this.

TL;DR: Implemented selective builds in our monorepo, and it's cut our build costs by ~70%.

I wrote up a detailed guide on how we did it, including:

- The concept behind selective builds
- How to implement it using GitHub Actions and Redis
- Code snippets and real-world examples
- Pitfalls we encountered and how to avoid them

It's not a silver bullet, but it's made a huge difference for our team. If you're dealing with monorepo headaches, especially in larger codebases, you might find this useful.

https://developer-friendly.blog/2024/08/05/supercharge-monorepo-cicd-unlock-selective-builds/

Happy to answer any questions or hear about your own monorepo war stories. What's worked (or spectacularly failed) for you?

https://redd.it/1ekszx8
@r_devops
Noob here. Should I build my project source code into an executable in my Dockerfile? Or should I copy the executable from host machine into container directly?

I am asking because I want to know what is the best practice, and most important, why.

What would be the best practice and why?

1) Copy source code into the image and build the program executable there
2) Copy the executable directly from the host machine into the image (skip build)

What is best? And why? Thanks!

https://redd.it/1ekrnag
@r_devops
A Blockchain ETL and efficient data pipline management webinar

Blockchain ETL has unique challenges for DevOps teams managing data pipelines. This webinar explores practical solutions and best practices for handling blockchain data at scale.

Webinar: Optimizing DevOps for Blockchain ETL Pipelines

Date: August 8th, 12 PM EDT

Topics:

1. Blockchain data architecture for high-throughput systems
2. Containerization and orchestration strategies for blockchain nodes
3. Monitoring and alerting for blockchain-specific metrics
4. CI/CD pipelines for blockchain data services
5. Live demo: Real-time blockchain data synchronization and indexing

Speakers:

Andrei Terentiev, CTO of [Bitcoin.com](https://Bitcoin.com)
Seb Melendez, ETL Software Engineer at Artemis

Key takeaways:

Strategies for maintaining data consistency across distributed ledgers
Performance tuning for blockchain data ingestion and processing
Security considerations in blockchain data pipelines
Q&A session addressing DevOps-specific blockchain challenges

Target audience: DevOps engineers, SREs, and technical leads working with blockchain infrastructure

Registration: Webinar Registration Link

https://redd.it/1ekusu0
@r_devops
RESUME REVIEW

Hello Everyone,

I need some feedback on my resume. I created it with a specific focus on achievements and improvements at the product/business level.

In particular, I need serious suggestions for point number 3 under the work experience section. I want to highlight my achievement of adding KEDA to the entire data warehouse pipeline, which significantly improved data processing efficiency. However, I'm struggling with how to word this effectively as an achievement in 2 lines to match the theme of overall resume

If you have any suggestions, please share them as they will help me a lot.

Thanks!



=============> https://imgur.com/a/ec9Gptt <====================

https://redd.it/1ekzo6c
@r_devops
New boss says I should be OK with being on call every other week

Had an interesting conversation with my new boss today that I'd love to get some perspective on. I work on a two person devops team supporting an application used by some fairly large players in the transportation industry in a critical role. This is an application that has SLAs with associated financial penalties and to be honest our customers, I think, expect that we have more invested in our operational capabilities than we actually do considering how little revenue we make a year from the whole thing.

Currently, myself and a junior engineer split an on call rotation that I set up 'voluntarily'. Previously, our alerts were just coming in to emails or SNS, which wasn't effective obviously, and so not having an easy way to get phone alerts I setup a free pager duty account. Thus began our 26 weeks each of 'official' on-call a year for which I am the escalation point so functionally speaking i'm on call 24/7/365 for the last few years. This has led to some pretty great uptime compared to what things were looking like previously but I never had a formal conversation about what should be expected of me in regards to on call

This past Saturday, we had an issue where a pet reporting service (Jasper Reporting Server, biggest pain in the ass ever I do not recommend) that had recently been updated to a new version became unresponsive due to a thread issue and unfortunately it did not get detected prior to a support ticket getting raised. My co-worker wasn't available when support contacted her and I was out for a walk and didnt have my phone so users were unable to generate reports for about 3 hours until I was back home

This incident prompted a retrospective today where I raised the point that we needed an incident response strategy in place for these types of situations because it was unreasonable to expect two people to split an on call rotation like this and say to our transportation customers that we're taking incident response seriously. I personally want to open up the on-call rotation to the development team as well and roll out some runbook automation for common tasks (such as restarting a service althought my boss was incredulous that i'd have to train people to do this). I can still be an escalation point but I don't need to//cannot be on call 24/7

My boss responded by making what I perceived to be a kind of shitty comment that two people managed the devops program at his previous job and being on call, even every other week or all the time, isn't that big of a deal. It was kind of a shitty comment because the way it was said kind of implied that we're lesser than the two people he worked with previously and that because we're lesser engineers thats why we have more operational issues and that the only reason we don't like on call is because of our own problems. There was a lot to unpack in that statement, especially given that I am on a team with a non-existent tooling budget, but whatever, I wont get sour because of some difficult talk after basically an undetected service outage

However, I do not personally agree with his position that being on call every other week is acceptable as having to plan to have a laptop with me is a non-trivial thing and the stress of knowing you could get an alert while I'm out at dinner is a lot, even if you don't get 'that many' alerts. I'm curious what other people's thoughts are on frequent on call for small teams?

It's probably time for me (I wasted too much time not learning kubernetes already) to move on but I wasn't sure if I was overreacting to his position about on-call because of the perceived slight

TL;DR Is expecting someone to take an on-call rotation every other week reasonable given that they're on a two person team one person being significantly more junior?


*edit* we are not compensated for on call hours worked outside of our yearly salaries

https://redd.it/1el1bfq
@r_devops
Flyway with Jenkins

Anybody here tried using this stack before? How was your experience? Does anyone have any use case I can use a reference? Currently trying out flyway if we can adapt it in our dev environment and if we should get the subscription... Any insight is appreciated.. thanks

https://redd.it/1el21aa
@r_devops
Configure ec2 in Github Actions workflow via SSH or use Ansible?

Working on a Github Actions workflow of which part is deploying an AWS ec2 via Terraform. To configure the ec2 instance for a Nodejs application, I could theoretically SSH or remotely run commands on the instance in the workflow - but is there an advantage to running an Ansible playbook via Actions workflow instead? One reason that may be in favor of Ansible: increases the modularity of the pipeline, meaning I could more easily port to another workflow or even CI/CD platform (Jenkins, etc) as the Ansible playbook is agnostic to CI/CD platform on which it rurns. Any other thoughts?

https://redd.it/1el1ryf
@r_devops
Careers after DevOps - experience or suggestions?

Awful economy and a stupidly wide-range of roles within "DevOps Engineer" that are almost impossible to fulful. So what are good exit careers after DevOps?

obviously development (if your programming skills are up to scratch)
what else?



https://redd.it/1elav9p
@r_devops
How OpenAI Scaled Kubernetes to 7,500 Nodes by Removing One Plugin

Hi everyone. I recently read an article about how OpenAI scaled Kubernetes to 7,500 nodes.

There was a lot of information in there but I thought the most important part was how they replaced Flannel with Azure CNI.

So I spent a lot of hours doing a bit more research into the specifics and here are my takeaways:

• Flannel is a Container Network Interface (CNI) plugin that is perfect for pod-to-pod communication between nodes

• Flannel works well for smaller clusters, it was not designed for thousands of nodes

• Flannel's performance got worse with the increased node count because of things like route table creation and traffic routing

• OpenAI already hosted its infrastructure on Azure and used the Azure Kubernetes Service (AKS)

• They switched from Flannel to Azure CNI, which is specifically designed for AKS

• Azure CNI is different from Flannel in several ways which made it a better solution for OpenAI

• The switch to Azure CNI ended up making pod-to-pod communication a lot faster

Okay, this is a super basic summary, but if you want a more detailed explanation with nice visuals, check out the full article.

https://redd.it/1eld525
@r_devops
What Python Frameworks do you use?

I was using the search feature as was surprised to not see a question raised about this. What frameworks should you learn as a devops engineer / what modules do you use? I know for a fact that everyone should learn to import csv or even flask / fast api.

What do you all use / think everyone should know how to use even on a basic level?

https://redd.it/1elgr21
@r_devops
Pull request branch auto-pull on target branch update

I haven't done so much DevOps in my life and need some advice on an issue I am facing. I didn't find something close to what I needed, either I missed it or didn't know how to phrase my question.

In my team, we tend to have 15-20+ open pull requests at a time, and it's quite bothersome when one gets merged the TL refuses to review anything else until they are up to date.

As you can imagine it gets annoying, and because the issue couldn't be solved by having them review the PR anyway, even if it's a couple of commits behind, I thought I would solve it technically.


Here is what I could stitch together as a CI-CD step:



updatebranches:
stage: update-branches
script:
- git fetch --all
- TARGET
BRANCH=$(git branch --contains $CICOMMITSHA | sed -n 's/^* //p')
- | for branch in $(git branch -r | grep -v '\->' | grep -v "$TARGETBRANCH" | sed 's/ *origin\///')
git checkout $branch
git merge origin/$TARGET
BRANCH
if $? -eq 0 ; then
git push origin $branch
else
echo "Merge conflict in $branch. Resolve conflicts manually."
fi
done



I would love any advice. Please tell me if this is bad practice, how I could approach it another way. What other options I have etc

https://redd.it/1eligfi
@r_devops
Blue/Green on Internal Service Microservice

Hi all, for those that are running a microservices environment and are able to perform blue/green deployments on an individual microservice basis - how exactly are you achieving this when performing blue/green on an api service that is consumed only by another microservice (and does not have a front end)?

Suppose the following traffic flow in AWS.

Client desktop browser -> ALB -> microservice_1 -> ALB -> microservice_2 -> ALB -> microservice_3

Suppose I wanted to perform blue/green on microservice two. I create another target group (blue) for micro-service 2 and keep traffic pointing to green. I now have the ability to directly hit the microservice-2-blue from some other machine and run a suite of smoke tests. That said I also validate the end-2-end flow from the client desktop to microservice_3, using microservice_2 blue.

I would imagine this would require some mechanism like an HTTP cookie (use_microservice_2_dark) that each of the intermediary devices would have to pass through each of the hops but I might be over thinking it.

Has anyone come across this particular pattern before?

Thanks!





https://redd.it/1elke7d
@r_devops
TechWorld with Nana DevOps Bootcamp vs KodeKloud Bootcamp

Hey everyone!

I know this has been asked in the past before, but I wanted to know if anyone has had any recent experience with taking the DevOps Bootcamp from TechWorld with Nana, or doing the DevOps / SRE learning path from KodeKloud?

I’m fortunate to have a learning budget at my company, so I’m not necessarily looking for the cheapest option vs finding the best fit in terms of learning material and practical experience. If anyone has other options as well or recommendations I’m happy to hear those as well!

https://redd.it/1elm09m
@r_devops