Reddit DevOps
268 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
How CrowdStrike is improving their DevOps to prevent widespread outages

On July 19th, you may have been affected by the computer outage caused by CrowdStrike's update. What you may not know is what DevOps practices they weren't following when deploying their update.

# Some background

Yesterday CrowdStrike posted an update giving a rundown of why exactly the outage happened and how they will improve their development and deployment processes to prevent such a catastrophic release again.

What happened in their update is they deployed a configuration file that erroneously passed an automated validation step. When computers loaded this update, it caused an out-of-bounds memory error that caused a semi-permanent BSOD, until someone with IT experience could fix the problem.

# Steps they are taking to deploy more effectively

Beyond their efforts to implement a [robust QA process](https://medium.com/@qacomet/what-we-can-learn-from-the-crowdstrike-outage-bc98c16b5426), they are also planning on following modern best DevOps practices for future deployments. Let's see how they are improving updates to production.

* **Staggered deployments**: Apparently when they updated their configuration files across customers systems, they weren't deploying them in multi-staged manner. Because of the outage, they will now deploy all updates by first having a canary deployment, then a deployment across a small subset of users, and finally staging deployments across partitions of users. This way if there's a broken update again, it will be contained to only a small subset of users.
* **Enhanced monitoring and logging**: Another way they are improving their deployment process is increasing the amount of logging and notifications. From what they said this will include notifications during the various deployment stages, and each stage will be timed so they can expect when a part of the process has failed.
* **Adding update controls**: Before this update end-users did not have many if any controls for CrowdStrike updates. This lets users on mission critical systems, like airlines or hospitals, control when updates are applied. This gives these users a blanket of protection from being part of early updates.



https://redd.it/1eeo8ps
@r_devops
Is there a CI service people actually like using?

Maybe one that isn't just a yaml configured script runner?

Or is there room here for something better that just hasn't been made yet?

https://redd.it/1eepsfw
@r_devops
monorepo for github actions

Hey, so I need to compile my github actions in place for ease of development and versioning. I was wondering if there is a way to create monorepo for such usecase case. What I am aiming at is to create gh action for multiple environment and version them, and release them on gh market place.

gh-actions-monorepo/
├── .github/
│ ├── workflows/some-way-to-release-on-marketplace
├── python/
│ ├── python-action-1
├── node/
│ ├── node-action-1
├── rust/
│ ├── rust-action-1
│ ├── rust-action-2
├── common/
│ ├── common-action-1
| ├── common-action-1


Is there any tooling and monorepo setup for such thing surrounfing this, eg we have [turborepo](https://turbo.build/) for node monorepos, which environment would be best for this??
Is there any existing example anyone know and can link it, that will be really helpful.

https://redd.it/1eeq2vj
@r_devops
Centralized logging of containers on different VMs

Hi devops!

I'm searching for a proper solution how to centralize logging across multiple VMs. My current approach is to copy a docker compose file via Ansible onto the VMs with a promtail which fetches the container logs and sends them into one Loki, which can be queried by Grafana.


This is how my docker-compose.yml looks like:

services:
caddy:
image: caddy
restart: always
ports:
- "9080:9080"
- "9081:9081"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./certs:/certs
- caddy_data:/data
- caddy_config:/config

cadvisor:
image: gcr.io/cadvisor/cadvisor
restart: always
devices:
- /dev/kmsg
privileged: true
volumes:
- "/dev/disk/:/dev/disk:ro"
- "/var/lib/docker/:/var/lib/docker:ro"
- "/sys:/sys:ro"
- "/var/run:/var/run:ro"
- "/:/rootfs:ro"

node_exporter:
image: quay.io/prometheus/node-exporter:latest
restart: always
command:
- "--path.rootfs=/host"
pid: host
volumes:
- "/:/host:ro,rslave"

promtail:
image: grafana/promtail
restart: always
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers
- /var/run/docker.sock:/var/run/docker.sock
- ./promtail.yml:/etc/promtail/promtail.yml
command: -config.file=/etc/promtail/promtail.yml
labels:
- "is-monitoring=true"

volumes:
caddy_data:
caddy_config:


`cadvisor` and `node_exporter` are secured by basic\_auth and self-signed https.

Is there a better solution? How you guys do this? All the VMs serve different applications with docker compose, also deployed with Ansible.

https://redd.it/1eestp0
@r_devops
Branch per environment viability?

Feels almost like posting a roast me to be asking this, we've been looking at different branching strategies and have landed on this, however every time I try to look up cicd processes and ways of working it feels like there's just a bombardment of trunk based being the only way.

There's a requirement from management to control releases to environments tightly (dev, qa, prod) and they don't want to utilise feature flags, so it came down to either deploying via tags or with a branch per env and it seemed easier to deploy hot fixes this way.

I was wondering whether anyone has success with this method, I'm not looking to implement trunk based so thank you but please don't suggest it as a fix, I'm more looking for anyone who's successfully working this way - or if you aren't, why not and why I shouldn't be, glaring issues that I'm perhaps missing.

I know it's a slower process however even a release per 2 weeks into production would be faster than the current and fast enough for ourselves, we'll be utilising a monorepo (backend, frontend, infra) but with a separate manifests repo for k8s config (this won't be a branch per env, just PR to main with kustomize overlays), thanks.

https://redd.it/1eewf86
@r_devops
Should I learn AWS or Azure in my current position?

Hi, I have been working in a mid-size software company based on Europe as a DevOps engineer for 2.5 years as my first job after graduation. My team is responsible for platform, hosting, monitoring and deployment. Our dev, test and pre-prod env are in robust on-prem PROXMOX clusters. Prod env are in Citrix. We have a small portion resources are in AWS. One guy of our team looks after the AWS infra. My boss rely on him cause he is the only one in my team who have some experiences in aws. He and our team-lead attends all the meetings and workshops regarding AWS. They always discuss regarding different client and issues in AWS and he got a lot of privileges'. I want to be a part of it. I never worked professionally in aws. I did AWS solution Architect and AZ-104 course during my study. Didn't sat for the exam. I think they are not getting me involved thinking I might not competent enough or some other reason. So, I am planning to uplift my skills in aws. I planned to take couple of advance courses on aws or get some certifications to showcase my interest. Recently a new client is proposing to have their infra in Azure but my boss and colleague trying to convince the client to move to aws because, we don't have any Azure guy in our team or my company might not want to move into multi-cloud formation.

Now I am confused, what should I do! Should I l gather skills on what we are currently using (though not sure if I'll get a chance to work with it)? or skills that is lacking my team (not sure either, if I'll get a chance ever to work with it)?

https://redd.it/1eez77u
@r_devops
What specific elements or strategies should a junior DevOps candidate include in their portfolio to effectively capture the attention of recruiters/AI and hiring managers?

I've 2Years of Experience as a support engineer. And soon, I'll be wanting to move to a new role as a devops engineer. What kind of portfolio does capture AI's attention/Recruiter's attention/Hiring manager's attention. I'll list some (that I learnt from my hiring managers if I get positive feedback).

https://redd.it/1ef0f6g
@r_devops
Interview Question

“How do you handle 20 tasks at once?”

I got this in an interview recently and it stumped me , mostly because I’m not sure anyone can effectively do that .

How would you respond ?

https://redd.it/1ef130v
@r_devops
Devops Interview Prep

Hi Folks,

I'm trying to prepare for Devops Interview and I have a question, how do guys prepare for Terraform? I mean obviously you don't create Infra on a daily basis in your organization? How do you keep up with the new content and get yourself ready for the interview for Terraform related questions? Any particular resource/topics you refer before the interview?

https://redd.it/1ef4mhs
@r_devops
How do you conduct an interview for a Devops role?

I have to conduct an interview for a DevOps role that heavily involves AWS. I want to know from the community how you judge if someone is really good at doing DevOps stuff. What questions do you generally ask?

https://redd.it/1eezql1
@r_devops
Senior Engineer, but don't know how to code

I have 9 years of professional linux admin experience, I'm not junior.

I have a legitimately high ranking title and respect of my peers that I've earned through regular promotions at my company that I've worked at for 5 years. I'm the senior-most member of my team of 11, of that group, only 2 has a 'real' software engineering background. It's just not what we do for the most part.


My responsibilities for my entire career have been entirely oriented around infrastructure, devops and SRE. I've gotten quite good in my domain, and with light scripting. I'll do Ansible, Chef, and 'easy' bash and python. I know enough about Golang and have contributed useful fixes to Go projects. I can tell you basically everything about python and Go syntax after years of immersion and reading of code. I can discuss software engineering with real devs all day and convince you I know what I'm doing. I've just never meaningfully built anything myself and have no muscle memory for it.


With all of that said, I don't know how to code at all beyond \~300 line python scripts to do simple tasks. Frankly, I don't want to become a 'real' software engineer, but I consistently feel bad about not being able to contribute more. ChatGPT has actually helped me a ton with this and getting better, but It's not enough.

==

A specific problem I'm facing right now is developing short-term tooling to manage bare-metal datacenter hardware. I have \~2000 of servers running a stack that I own across many sites that exhibit near constant failures for one reason or another and I want to fix it. I've taken several stabs now at developing a controller that watches hardware and attempts to remediate through our PXE boot server and eventually filing Jira tickets for the DC team to inspect issues that cannot be fixed by just re-installing the server. This involves a fair bit of state management involving human input by DC techs, ensuring the servers are truly offline. What about partial failures. Many types of hardware failures. What about transient failures because the DC as a whole is having issues: I can't just nuke everything in this case. It really is a hard problem with high stakes.


What I have is an unreliable hodgepodge of scripts that are disparate and frankly don't work well. It's a surprisingly difficult problem since failure modes are myriad, humans are in the loop of remediation, the PXE booter is extremely unreliable and risk is high of making the situation worse.
I just don't really know how to write code enough in such a way to solve this problem. The problem is far beyond anything else I've made myself.

==

A lot of this is mental, I just have had zero formal education in computer science or software engineering and have gotten pretty far just figuring things out as I need to, which just often hasn't involved writing more complex pieces of software myself.

Compromising me here is knowing any tool I make is going to be solely supported by myself, will only be short-term as there is a separate project to completely overhaul the way infrastructure is managed that will obsolete whatever I build. That overhaul will not be complete for another 1-2 years.

It's also just a lot of effort. I've been extremely stressed lately for work, personal and immigration reasons. While 9 years might not sound like a lot in the grand scheme of things, I'm close to early retirement and could do it in 3-4 years if nothing stupid happens at current rates. Frankly I'm tired of my career and just want to retire already. I have zero passion for software engineering.
I don't really know where I'm going with this, but I wanted to write it out. It sounds a lot like whining reading it back, but I burned the candle from both ends for 9 years to get where I am today.

Maybe I just move into management if my current boss leaves and never worry about it again.
tldr I'm facing new problems I don't know how to deal with, and am conflicted
\- I'm not a full time SWE and
wasn't hired to be one
\- It's a lot of time and effort I just don't have anymore
\- Is it even worth the effort if I had it
\- Chance of failure is high
\- Chance of wasted effort is high
\- Sense of embarrassment about asking for help on something your average swe intern could do better than me. Don't want to waste the time of others.

​

I don't know how to learn to code something that's meaningfully complex involving state machines. This isn't just a crud api with a tutorial to follow.

https://redd.it/1ef82hq
@r_devops
How i switched to devops after 9 years working as Linux support engineer

2.5 year back I was 31 years old and had successfully wasted nine years of my career stuck at IBM doing mediocre work with a significantly lower salary.

I had come from an NIT and was stuck in this level of work while my college mates climbed the success ladders. Some worked in the USA/UK, some went onsite, and others enjoyed senior-level positions at big companies and all the shiny glamour of a successful career.

I was working on a Platform support role, which people looked at with pity.

I was working the night shift and providing on-call support on weekends. I had no work-life balance, and my health was getting worse due to lack of sleep.

I was stuck in a horrible comfort zone, scared of the change. Imposter syndrome and a severe lack of self-worth were constant companions, and I had zero confidence in myself.

To make matters worse, I got married. I was under a lot of presure financially and started getting panic attacks due to the fear of getting laid off, as I lacked the skills to do anything other than support work.

After many sleepless nights, I realized something.

If you will change nothing, nothing will change.
I decided to make a career switch to Devops as it was something related to work I have been doing for years as Linux and Aix support engineer.

I started researching online about the devops roadmap, and it was no help as all the posts talked about learning a plethora of tools, and learning all of them felt impossible.

So I turned to YouTube to find better guidance for devops and stumbled upon a channel, Techworld with Nana. It was good and gave me some confidence.

I decided to focus on essential tools for devops and mastering them.

One cloud platform — I choose AWS
One infrastructue as code tool — Terraform
Linux and docker
Version control tools — Git and GitHub
One CICD tool — Github actions
Scripting — Python
I started deep-diving into the above topics by watching YouTube videos and reading medium blogs on all these topics.

I followed the resources and did a lot of hands-on with these tools. I also went through AWS and Terraform documentation.

After one month of hard work, I started getting some confidence.

I realized that if I needed to get some real-world working experience. I spoke to a few of my friends who worked as devops engineers. I asked them about their day-to-day work and the kinds of work they do.

As per their suggestions, I created multiple projects to practice.

Deploy a 3-tier architecture on AWS with Terraform.
Deploy a sample flask project into ec2 instances using docker and GitHub Actions.
Deploy Lambda function to send weekly reports.
Managing s3 buckets with CLI commands.
Deploying a Flask API in AWS ECS with Terraform.
They also suggested I learn Kubernetes.

I spent another 1-month doing hands-on lab and learning Kubernetes along with that.

By the end of 2 months, I was confident to start giving interviews. I did some research and updated my resume.

I wanted to make my resume stand out, so I used Canva for predesigned resume templates and built a professional-looking resume.

I also understood that I cannot switch to devops without showing any relevant experience.

I added 2.5 years of devops experience and curated the devops experience using my friend's resumes.

I updated my LinkedIn and Naukri profiles. After one week, I started getting a lot of calls for various roles around devops.

I crapped my pants in the first few interviews as they asked the question that only an experienced devops engineer would answer.

I did not let it discourage me, as I knew it would happen. I used the questions the interviewers asked and prepared for the topics around them in depth.

After three/four interviews, I started getting better.

Shortly, I received two offer letters from relatively small companies.

I continued giving more interviews and got three more offer letters from reputed companies. I used these offer letters and negotiated a good package(2.5x of my current CTC).

I resigned from IBM and joined
after serving the notice period.

It changed my life completely. I had everything.

A handsome salary with a bunch of great benefits.
Respected designation at a reputed company.
Day working hours, free weekends, and work-life balance.
Confidence, self-worth, and motivation to do more.
I am inspired to take my career to another level.

https://redd.it/1efa2vk
@r_devops
You think you know pain?

Pain is importing 30 + AWS accounts into terraform from scratch, that have been click-opsed to hell for years..

Better late than never

https://redd.it/1efao9i
@r_devops
Has anyone heard of blackbird for API development?

I like to watch some of the videos TFiR puts out cause sometimes they uncover some early days/lesser known tools that end up being pretty cool. I saw this one they put out a few days ago on this tool called Blackbird for API development. I did some digging and it looks like it's still in beta (but free) once you sign up? Has anyone here played around with it yet for API spec generation or testing?

Here's the video I was referencing:
https://youtu.be/hb\_V54E0B78

https://redd.it/1efbr8f
@r_devops
Platform Engineering roles advertised as DevOps engineering?

I'm currently on the jobs market due to being made redundant (team was moved to a cheaper employment market) and I've noticed that a huge portion of the "DevOps Engineer" roles are more like platform engineering roles. They'll be basically "manage this public cloud infra" and nothing to do with CI\\CD, let alone the DevOps lifecycle.

My current (well, former now I guess...) role had me managing CI\\CD platforms, implementing best practices around testing, binary controls and security scanning, making sure there were good feedback loops into Jira from testing and maybe a little bit of system admin around making sure legacy build infrastructure was kept ticking along.

As a result, I'm actually not cloud orientated. I just wasn't trained in it as that was the cloud ops teams area. I worked with them plenty, but it wasn't my area. We were meant to get some formal training so we could better help or have better input but it was delayed and delayed. I suspect my team had been put forward for the chopping block, so they never bothered.

Sure, I can agree that platform engineering can be part of the DevOps lifecycle, if it's regarding deployments, but that's different to "Manage this cloud infra from the bottom up".

https://redd.it/1efa843
@r_devops
AWS vs GCP vs Azure

I was wondering which cloud platform to start learning with. Like from ease of use perspective, handling complexities, etc.

https://redd.it/1effwzg
@r_devops
“We just signed a 12 month contract for this service so you have to use it”

Why don’t product teams and leadership ever conceptualize the sunk cost fallacy and that it will be ever harder to move away from a service once you invest more and more engineering hours into it?

https://redd.it/1efi2y4
@r_devops
Best Practice for Docker and Kubernetes Deployment

What's the best practice for creating pipelines from GitHub to Docker registries to cloud Kubernetes clusters?

Like pushing to the GitHub repo triggers a Docker image push?

https://redd.it/1efi25o
@r_devops