Reddit DevOps
266 subscribers
30.9K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Joining SRE Role in a Top Fintech Company: Is It Really Worth It?

I’m excited to share that I’m joining as an SRE (Site Reliability Engineer), even though my initial goal was to become a developer. Unfortunately, there weren't any available developer roles at the moment. I'll be working with OpenShift and Unix technologies.

I’m a bit concerned about my career progression with these technologies. Does anyone have experience with this and can share their thoughts on the career path for SREs? Also, are these technologies interesting to work with? And is it possible to transition to a developer role in the future?

Thanks for any advice!

https://redd.it/1edcw39
@r_devops
I built an open-source tool to make on-call suck less

Hey y'all,

TL;DR

I am building an [open source platform](https://github.com/opslane/opslane) to make on-call better and less stressful for engineers. We are building a tool that can silence alerts and help with debugging and root cause analysis. We also want to automate tedious parts of being on-call (running runbooks manually, answering questions on Slack, dealing with Pagerduty).

Here is a quick video of how it works: [https://youtu.be/m\_K9Dq1kZDw](https://youtu.be/m_K9Dq1kZDw)

I hated being on-call for a couple of reasons:

**- Alert volume**: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.

**- Debugging**: Debugging an alert or a customer support ticket would need me to gain context on a service that I might not have worked on before. These companies used many observability tools that would make debugging challenging. There are always a time pressure to resolve issues quickly.

There were some more tangential issues that used to take up a lot of on-call time

**- Support**: Answering questions from other teams. A lot of times these questions were repetitive and have been answered before.

**- Dealing with PagerDuty**: These tools are hard to use. e.g. It was hard to schedule an override in PD or do holiday schedules.

I am building an on-call tool that is Slack-native since that has become the de-facto tool for on-call engineers.

To start off, Opslane integrates with Datadog and can classify alerts as actionable or noisy.

We analyze your alert history across various signals:

* Alert frequency
* How quickly the alerts have resolved in the past
* Alert priority
* Alert response history

Our classification is conservative and it can be tuned as teams get more confidence in the predictions. We want to make sure that you aren't accidentally missing a critical alert.

Additionally, we generate a weekly report based on all your alerts to give you a picture of your overall alert hygiene.

What’s next?

* Building more integrations (Prometheus, Splunk, Sentry, PagerDuty) to continue making on-call quality of life better
* Help make debugging and root cause analysis easier.
* Runbook automation

We’re still pretty early in development and we want to make on-call quality of life better. Any feedback would be much appreciated!

https://redd.it/1edhedn
@r_devops
Hopium Looks like the market is coming back for mid-level engineers and seniors!!

Noticing tons of job postings, more recruiter DMs and a lot of anecdotal experiences of my friends job hopping to double their TC.

It's still not where it should be, but damn boiz... brings a tear to my eye.. we are slowly getting back there!!!

Even seeing some SDE 1 positions at a few FAANGs now for entry level folks

Keep on hustling. We're all going to make it.

https://redd.it/1edirof
@r_devops
Am I out of touch? (interview)

I had my first coderbyte challenge and it gave me 3 mediums and 1 hard to solve in 5 hours.

I also had long response questions like:

What is Docker? Kubernetes?

Which of these is not a service?
ALB, ELB, NLB, SWE

What command would you run to see pods running in kubernetes namespace main?

At what point is 4 leetcode problems necessary? Surely 2 would provide enough information if I should move to the next round..

Further, why am I asked 3 medium/ 1 hard leetcode questions, and then joke questions for anything related to devops/platform?

And no, I didn’t even attempt this because i’m fortunately happily employed.

https://redd.it/1ednhzh
@r_devops
Roadmap Devops

Hi guys which are the best resources to study devops

Is it possible to study all self study?

https://redd.it/1edwebl
@r_devops
Docker course

Hello Docker champs 🏆,

If you had to choose just one resource to learn Docker online, what's your top choice? Or to put it another way, whose videos or documentation did you follow to get where you are today, along with hands-on practice?

https://redd.it/1edxfht
@r_devops
Windows servers in a devops environment

I'm working very hard to create a devops culture around our dev workflows on linux, but we also have a largely manual windows environment that also needs to be dealt with.

We don't currently have a good tool to manage Windows servers, and I'm debating if we should try to use Ansiblle (or puppet) or if this would be just too weird and non-standard and if we should find something windows specific.

https://redd.it/1edyqsk
@r_devops
Question regarding DevSecOps from Application Security

I have been working as an application security engineer for the past 3 years and 2 years of VAPT before that. I am now looking to properly add devsecops into my skills.
I have experience with Azure, Docker and security scanning tools.
What are some other tools and technologies I should focus on other than Kubernetes?
Should I also learn Jenkins, despite having knowledge on azure devops and github actions for better jobs in the future.
Also what certifications I should go for other than Azure Security Professional? Should I also get similar certificates for AWS or GCP?

Thanks.

https://redd.it/1ee4hfe
@r_devops
Need DevOps Freelance Job

Hii Guys

I am a DevOps Engineer with 4+ years of experience and I have worked in Azure and AWS cloud , and almost worked on all the tools. I am in need of any freelance opportunity . Please let me know if anyone wants any support/help or is hiring anyone. i am ready to work in your time zone. Message me i will share my brief resume with major accomplishments.

https://redd.it/1ee6lur
@r_devops
Runbook automation(execute script) vs lambda

So I am triggering an event bridge such that it executes a script in response of an event
I have 3 choices
1)I can use a lambda and create my own bash script for it
2)lambda with Python scripting
3)execute script action of runbook automation(Python script)

What is the better way to go with and why would you choose that?!Also does it really make a difference since all are serverless?!

https://redd.it/1ee8l41
@r_devops
How to deploy Azure ML batch endpoint from docker image?

Hi, I have my own deep learning task that requires 2-3 different ml models, I built the code and containerized it, i.e. the python env and code is in the docker image.

I am running fastapi servers inside docker to run code.

Deployed it in aws sagemaker async endpoint and it is working fine.

Now, I need to deploy it to azure ml batch endpoint, but there's no documentation as such to deploy it using custom docker container.

Can someone help me?

https://redd.it/1eeab4t
@r_devops
[Helm, Traefik, Nginx]: Application Routing results in 404 :(

Hello, my fellow humans,
I'm currently facing a small issue where I'm kind of stuck.

I'm working on a react application with vite and using React router dom for software routing.

For the deployment Kubernetes, Helm & Traefik are used.

The application originally had only the '/' & '/base'.

Currently, the application now requires more routes to cover the desired features. Thus, I have implemented the following routes in my react application:
- Route Root: '/' // <- This redirect to /base
- Route Base: '/base' // <- This shows a landing page.
- Route Sub1: '/base/A' // <- This shows page 1.
- Route Sub2: '/base/B // <- This shows page 2.

Locally everything works out of the box.


## The Problem:

Upon deployment:
- Navigation through the routes using the application buttons works as expected.
- A manual navigation to the Base or Root result in the application landing page being shown correctly.
- The problem arise upon a manual navigation to either subroutes results in 404 from the nginx.

Here are only the relevant code sections form the relevant files:


## The Code:

### `values.yaml`

```
frontend:
replicaCount: 3
images:
repository: //internal repo name
tag: latest
pullPolicy: Always
port: 8080
targetPort: 8080
healthPort: 8080
urlPrefix:
- /{base:(base(/.*|/\.+.*)?$)}
trimPrefix:
- /base
errorUrls:
- /401.html
- /404.html
- /50x.html
```

### `frontendingress.yaml`

```
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name:application-frontend
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web, websecure
traefik.ingress.kubernetes.io/router.priority: "10"
traefik.ingress.kubernetes.io/router.middlewares: {{ if .Values.tls.enabled }}redirect-to-https@file,{{- end }} auth@file, {{.Release.Namespace}}-strip-frontend@kubernetescrd
{{ if .Values.tls.enabled -}}
traefik.ingress.kubernetes.io/router.tls: "true"
{{- end }}
spec:
ingressClassName: {{.Values.ingress.class}}
rules:
- host: {{.Values.ingress.host}}
http:
paths:
- path: {{ index .Values.frontend.urlPrefix 0 }}
pathType: Exact
backend:
service:
name: application-frontend-svc
port:
number: {{.Values.frontend.jwtProxy.port}}
{{ if .Values.tls.enabled -}}
tls:
- hosts:
- {{.Values.ingress.host}}
secretName: {{.Values.tls.secretName}}
{{- end }}

```


### `frontendmiddleware.yaml`

```
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: strip-frontend
spec:
stripPrefix:
prefixes:
- {{ index .Values.frontend.trimPrefix 0 }}
```


### `nginx.conf` in the project folder nginx/:

Along with `404.html, 401.hmtl, 50x.html`

```
map $http_user_agent $loggable {
~^kube-probe 0;
default 1;
}

server {
server_tokens off;

listen 8080;

absolute_redirect off;

location "/" {
autoindex off;
root /usr/share/nginx/html;
index index.html index.htm;
try_files $uri $uri/ =404;
add_header Cache-Control "no-store, no-cache, must-revalidate";
}

error_page 404 /404.html;
error_page 500 502 503 504 /50x.html;

location = /50x.html {
root /usr/share/nginx/html;
}
location = /404.html {
root /usr/share/nginx/html;
}

access_log /var/log/nginx/access.log main if=$loggable;
}

```

In my frontend ive implemented the route as following:

### `Routes.ts`

```
export const AppRoutes = () => {
const hasImageEntitlement = useStore((state) => state.hasImageGenEntitlement);

return [
{ path: Constants.AppRoutes.ROOT_PATH, element: <Navigate to={Constants.AppRoutes.BASE_PATH} /> },
{
path: Constants.AppRoutes.BASE_PATH,
element: <AppLayout />,
children: [
{ path: Constants.AppRoutes.GPT4TURBO_PATH, element: <AppLayout /> },
{
path: Constants.AppRoutes.DALLE3_PATH,
element: hasImageEntitlement ? <AppLayout /> : <Navigate to={Constants.AppRoutes.BASE_PATH} />,
},
],
},
{ path: '*', element: <h1>The route doesnt exist show 404 after resolving the 404 subroute problem</h1> },
];
};

```


### `App.tsx`:

```
const appRouter = createBrowserRouter(
createRoutesFromElements(
<>
{appRoutes.map((route) => (
<Route key={route.path} path={route.path} element={route.element}>
{route.children?.map((child) => (
<Route key={child.path} path={child.path} element={child.element} />
))}
</Route>
))}
</>,
),
{
basename: `${import.iss.oneta.env.VITE_BASE_PATH}`,
future: {
v7_normalizeFormMethod: true,
v7_relativeSplatPath: true,
v7_fetcherPersist: true,
},
},
);
return (
<RouterProvider
router={appRouter}
future={{
v7_startTransition: true,
}}
/>
);
```




I'm devops noob and the guy who set the whole thing up is not around anymore! so im on my own in this matter. Im trying to learn as much as I could. So sorry if i am a bit stupid to see the solution :/

I very much appreciate your help and hope you all have a greate day at least better than mine. :)

Thanks in advance.

https://redd.it/1eee4br
@r_devops
Tomcat server

Im having so much difficulty understanding the deployment part in the tomcat server. Im a newbie to devops and tomcat was part of my course and dont have a lot If technical computer knowledge. I dont understand what context path here means also my Teacher is so lazy they never answer and when they do they take weeks and i want my concern solved asap so My teacher just randomly entered /abc and it worked for him now my question is can context path be anything in the world? And then he put his .war in tmp directory whilst sitting in root on his linux system and went to the war path and put /tmp/warfilename.war and it worked for him. It didnt work for me. What is happening? Why is this happening? I coildnt find any tutorials on this too if anyone can find me some nice tutorial on youtube that would be very helpful.
And reddit is not letting me post a picture here idk why if somebody want to help me pls dm me or comment down below and i will dm 😭

https://redd.it/1eeet14
@r_devops
deploying artifacts with msdeploy.exe

Hi all, we used to have pipelines that would build and deploy at the same time. Now we build and store the artifacts in Azure blob, we used msbuild and deploy on build which would build and deploy to IIS. See example command below:

msbuild.exe project.proj -t:Restore /m /t:Build /t:Clean /p:Configuration=Release /p:EnvironmentName=Prod /p:RunAnalyzers=false /p:DeployOnBuild=True /p:WebPublishMethod=MSDeploy /p:MSDeployPublishMethod=WMSVC /p:AllowUntrustedCertificate=True /p:CreatePackageOnPublish=true /p:MSDeployServiceUrl=$serverDest /p:SkipInvalidConfigurations=true /p:DeployIisAppPath="mainsite/web" /p:UserName=$uname /p:Password=$pass /p:SkipExtraFilesOnServer=True /p:AssemblyVersion=$gitTag /p:nodeReuse=false /p:FileVersion=$gitTag


Now that we have the zipped artifact I am trying to use msdeploy.exe (Web Deploy 3.6) to deploy to the remote server but the msdeploy documentation is not great and I want to be able to use the same options as msbuild but they do not translate to msdeploy. This is what I have

msdeploy.exe -verb:sync -source:package=azFileName.zip -allowUntrusted -dest:auto,ComputerName=$serverDest,UserName=$uname,Password=$pass,AuthType=Basic -enableRule:DoNotDeleteRule -skip:Directory="/App_Data" -setParam:name="IIS Web Application Name",value="mainsite/web"


is there a way to use msbuild.exe to deploy an artifact with a --no-build option or something?







https://redd.it/1een66c
@r_devops
How CrowdStrike is improving their DevOps to prevent widespread outages

On July 19th, you may have been affected by the computer outage caused by CrowdStrike's update. What you may not know is what DevOps practices they weren't following when deploying their update.

# Some background

Yesterday CrowdStrike posted an update giving a rundown of why exactly the outage happened and how they will improve their development and deployment processes to prevent such a catastrophic release again.

What happened in their update is they deployed a configuration file that erroneously passed an automated validation step. When computers loaded this update, it caused an out-of-bounds memory error that caused a semi-permanent BSOD, until someone with IT experience could fix the problem.

# Steps they are taking to deploy more effectively

Beyond their efforts to implement a [robust QA process](https://medium.com/@qacomet/what-we-can-learn-from-the-crowdstrike-outage-bc98c16b5426), they are also planning on following modern best DevOps practices for future deployments. Let's see how they are improving updates to production.

* **Staggered deployments**: Apparently when they updated their configuration files across customers systems, they weren't deploying them in multi-staged manner. Because of the outage, they will now deploy all updates by first having a canary deployment, then a deployment across a small subset of users, and finally staging deployments across partitions of users. This way if there's a broken update again, it will be contained to only a small subset of users.
* **Enhanced monitoring and logging**: Another way they are improving their deployment process is increasing the amount of logging and notifications. From what they said this will include notifications during the various deployment stages, and each stage will be timed so they can expect when a part of the process has failed.
* **Adding update controls**: Before this update end-users did not have many if any controls for CrowdStrike updates. This lets users on mission critical systems, like airlines or hospitals, control when updates are applied. This gives these users a blanket of protection from being part of early updates.



https://redd.it/1eeo8ps
@r_devops
Is there a CI service people actually like using?

Maybe one that isn't just a yaml configured script runner?

Or is there room here for something better that just hasn't been made yet?

https://redd.it/1eepsfw
@r_devops
monorepo for github actions

Hey, so I need to compile my github actions in place for ease of development and versioning. I was wondering if there is a way to create monorepo for such usecase case. What I am aiming at is to create gh action for multiple environment and version them, and release them on gh market place.

gh-actions-monorepo/
├── .github/
│ ├── workflows/some-way-to-release-on-marketplace
├── python/
│ ├── python-action-1
├── node/
│ ├── node-action-1
├── rust/
│ ├── rust-action-1
│ ├── rust-action-2
├── common/
│ ├── common-action-1
| ├── common-action-1


Is there any tooling and monorepo setup for such thing surrounfing this, eg we have [turborepo](https://turbo.build/) for node monorepos, which environment would be best for this??
Is there any existing example anyone know and can link it, that will be really helpful.

https://redd.it/1eeq2vj
@r_devops
Centralized logging of containers on different VMs

Hi devops!

I'm searching for a proper solution how to centralize logging across multiple VMs. My current approach is to copy a docker compose file via Ansible onto the VMs with a promtail which fetches the container logs and sends them into one Loki, which can be queried by Grafana.


This is how my docker-compose.yml looks like:

services:
caddy:
image: caddy
restart: always
ports:
- "9080:9080"
- "9081:9081"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./certs:/certs
- caddy_data:/data
- caddy_config:/config

cadvisor:
image: gcr.io/cadvisor/cadvisor
restart: always
devices:
- /dev/kmsg
privileged: true
volumes:
- "/dev/disk/:/dev/disk:ro"
- "/var/lib/docker/:/var/lib/docker:ro"
- "/sys:/sys:ro"
- "/var/run:/var/run:ro"
- "/:/rootfs:ro"

node_exporter:
image: quay.io/prometheus/node-exporter:latest
restart: always
command:
- "--path.rootfs=/host"
pid: host
volumes:
- "/:/host:ro,rslave"

promtail:
image: grafana/promtail
restart: always
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers
- /var/run/docker.sock:/var/run/docker.sock
- ./promtail.yml:/etc/promtail/promtail.yml
command: -config.file=/etc/promtail/promtail.yml
labels:
- "is-monitoring=true"

volumes:
caddy_data:
caddy_config:


`cadvisor` and `node_exporter` are secured by basic\_auth and self-signed https.

Is there a better solution? How you guys do this? All the VMs serve different applications with docker compose, also deployed with Ansible.

https://redd.it/1eestp0
@r_devops
Branch per environment viability?

Feels almost like posting a roast me to be asking this, we've been looking at different branching strategies and have landed on this, however every time I try to look up cicd processes and ways of working it feels like there's just a bombardment of trunk based being the only way.

There's a requirement from management to control releases to environments tightly (dev, qa, prod) and they don't want to utilise feature flags, so it came down to either deploying via tags or with a branch per env and it seemed easier to deploy hot fixes this way.

I was wondering whether anyone has success with this method, I'm not looking to implement trunk based so thank you but please don't suggest it as a fix, I'm more looking for anyone who's successfully working this way - or if you aren't, why not and why I shouldn't be, glaring issues that I'm perhaps missing.

I know it's a slower process however even a release per 2 weeks into production would be faster than the current and fast enough for ourselves, we'll be utilising a monorepo (backend, frontend, infra) but with a separate manifests repo for k8s config (this won't be a branch per env, just PR to main with kustomize overlays), thanks.

https://redd.it/1eewf86
@r_devops