Reddit DevOps
271 subscribers
11 photos
31.1K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Monitoring and Alerting

Hello,

I am looking to see what are the best practices for monitoring and alerting in the devops world.

I am currently working as a NOC Operator/Analyst for my company that specializes in retail. I mainly oversee all servers (on-prem & cloud), UPS, switches and ISP circuits at our corporate office. I have alerts set in place if any issues occur along with triage/troubleshooting steps and RCAs created after the issue is resolved.

Recently, I was asked to assist our IT Ecom/Dev team to help revise the current monitoring and alerts system they have in place. Within a month, I was kinda brought up to speed for someone who has no devops experience. So far, I was provided a list of alerts we have in place and where they occur on the data flow map and basic knowledge on how all services are linked to our middleware. I also have knowledge on order flow and "Life of a SKU".

I've already noticed some gaps/areas in need of improvements. But here is a small list of what I plan to work on. Please let me know if I'm on the right track and if there's anything I should look into.

Missing descriptions in the alerts
currently, some of the alerts have a basic "Failed to do XYZ" but doesn't explain why. So myself and the dev team will have to spend time looking through logs to find out what went wrong. I believe we can place logic for specific error codes to specify the root cause.
Missing Triage/Troubleshooting
If an alert comes in, there's no instructions on what to do. Like who to notify or what to check. I believe we need to have a process in place for each alert especially SEV1's . It can be a status check on services, cloud instances, and check logs for errors. Then escalate to the correct team depending on the issue.
Run a weekly report on common alerts
Run a report to review alerts that come in frequently to investigate trends and see if we can set automation to resolve the issues or set up preventive action so the error will no longer occur.

Here are the tools I have access to:

Grafana
GCP (Pub/Sub) (Logs)
OpsGenie Alerts
OMS dashboard
WMS dashboard
Shopify

I was also wondering if there's a role that specializes in monitoring and alerts + investigations in the devops environment. I've been enjoying learning a lot about pipelines/workflows and would love to be that support system for our team so the Devs can focus on their sprints and less on the issues that randomly come up.

https://redd.it/1divjvh
@r_devops
Openshift project help

Hi y’all, I’m working on an Openshift project for school - basically it is to build a 3 tier architecture (with 3 containers, one for DB2, one for the app itself and one for the api).

I’m not sure how to start. I’ve written the code for the app already and been stuck on the db2 and api communication till now.

Any ideas would be greatly appreciated.

Can provide link for the Git repo if needed

https://redd.it/1diyeov
@r_devops
Cloud for DevOps

Just getting started with my devops journey. Most courses in Youtube for DevOps require AWS but I am more familiar with GCP as compared to AWS (beginner level in GCP - cleared GCP ACE).

Is it possible to continue my journey with GCP or is AWS a necessary requirement for learning DevOps and for future use ?

https://redd.it/1div6em
@r_devops
Any Learning tool where I can practice the concepts I learned ?

Hi all,
Just curious to know if there is any online tool./ website where I can learn and practice at same time. Switching from video tutorials to my Linux machine is a little distracting. Looking for places where this could be more engaging and efficient.

https://redd.it/1dj9u9v
@r_devops
PlatformCon 2024 Workshop: Deep Dive: Delivering and Managing an LLM Agent Application with KusionStack

Youtube: https://www.youtube.com/watch?v=ekYrvL27gv4

This workshop serves as an in-depth exploration of KusionStack and how it can be leveraged to deploy an LLM Agent application to the cloud. The demo consists of several stages that progressively carry out a story to deploy and manage an LLM Agent, each slightly more complex than the last.

This session is designed to provide a practical understanding of managing complex applications in modern cloud-native environments. Through real-world examples, we'll demonstrate KusionStack’s pivotal role in enabling the realization from intents to actual deliveries, providing actionable insights for leveraging KusionStack in a similar environment.

Prerequisite:
- A Kubernetes cluster. Minikube or Kind is fine too
- AWS Account with AccessKey and SecretKey ready (Optional, for provisioning cloud resources)
- An OpenAI key (Optional, for testing the Agent Application)

Website: https://www.kusionstack.io/docs/
GitHub: https://github.com/KusionStack

https://redd.it/1djbbyw
@r_devops
Is your companies tech documentation also a mess?

Hey everyone, I work at a medium sized tech company and our documentation is all over the place. We use Confluence mainly, but also Gitlab readmes and Slack channels. The information is spread out and it's quite hard to find the information that you need. From reading this subreddit, it seems a lot of tech companies are in the same boat.

Has anyone used or built anything internally to help with this?

I'm thinking something like a semantic search across Confluence, Gitlab, and Slack channels. Anybody know of something that can do this?

https://redd.it/1djd7yz
@r_devops
How do you roll out components to your clusters?

I have a vague idea and would like some help detailing!

Say you need to upgrade Kong using TF and Flux
You have one AKS cluster in each region

Do you go in each cluster repo and upgrade one at a time

or

Do you go to one repo and upgrade to all simultaneously?


The problem is that there are many clusters, is there an architecture where you could just upgrade to all at once? What is the requirement there, to have applications running on infra that can handle it?

https://redd.it/1djflbb
@r_devops
Unmasking HTTP Logs: From Blind Spots to Full Visibility with Gleam and Quickwit

https://blog.kalvad.com/unmasking-http-logs-from-blind-spots-to-full-visibility-with-gleam-and-quickwit/?ref=kalvad-newsletter

The author came up with an original solution to monitor HTTP logs (NGiNX, Apache, Traefik,...) with all critical details like request and response bodies.
They use Gleam(friendly language for building type-safe systems) to build a proxy which captures all necessary details and send structured logs to a storage backend Quickwit.

What is your opininon on this approach? How do you produce detailed structured HTTP logs on your infra?






https://redd.it/1djgaa3
@r_devops
Be creative when it comes to naming environments

https://shippingbytes.com/2024/06/19/be-creative-when-it-comes-to-naming-environments/


Do you call your environment with canonicals name like prod, staging? I don't do that anymore because there is never a single prod, or dev does not stay dev for a long time. I have a different trategy

https://redd.it/1djiu4e
@r_devops
10 Books to Accelerate your Cloud Career

Not a comprehensive nor definitive list, but a list of books that have helped me personally a great deal. Interested to see if you agree, and to know which titles you would include in your top 10.

# https://medium.com/@jake.page91/10-essential-books-to-accelerate-your-cloud-career-f43f33d5e859

https://redd.it/1djjknq
@r_devops
What should be in your "interview" repo?

I haven't interviewed for a job in well over 10 years at this point so I am honestly just ramping up into what I'm going to need to prepare for. I fell into this position through being an onsite Data Center Sys Engineer, Virtualization Engineer, technically I'm in the SRE group, but do DevOps deploys and general infrastructure for our different cloud envs. I can script in bash, powershell, python, etc, but nothing that is ever going to impress anyone. I'm less a "move fast and break things" kind of person in favor of stability and making sure my team can sleep at night without outages kind of person.

I already know if they give me leetcode/hackerrank type questions I'm not going to pass. They gave us hackerrank questions at work to "review" for new hires and half of them were completely irrelevant to our day to day. I have never needed to build a hashmap from scratch since I primarily work in Terraform and Ansible which abstract that work and python has libraries for that. But the company also has a policy of "only hiring developers" for SRE and DevOps roles.

I've seen people talking about having a repo as a sort of "portfolio." I'm not trying to seed my startup or be a thought leader. I just want to do a good job at work and go home. I love learning new things so I'm not against remediating any skills gaps.

What are some standard projects I can work on doing mock ups with and but in a repo?

I prefer to work in Terraform and Ansible but I was thinking:
-bash scripts for Linux Admin
-powershell scripts for VMware Admin
-Python SDK for AWS deploys

https://redd.it/1djkm2f
@r_devops
code reviews become test plans

I have noticed more and more often that a devops code review becomes just a discussion as to how best to test the change. The most recent one was making an adjustment to a k8s network policy. Everyone looked at it and said, "looks okay to me but maybe you should test it."

This doesn't seem wrong to me. Just curious if others have seen this trend also.

https://redd.it/1djn8c1
@r_devops
Azure DevOps pipeline - anyone know how to make a change log from pipeline content?

Hello, I'm working with a project where I would like for every time a pipeline is run, that it creates a changelog update based on the contents of git.
I have seen others using pipelines go update the azure DevOps wiki, I'm wondering if anyone has had any experience with something that works?

https://redd.it/1djl8cy
@r_devops
Azure DevOps self-hosted agents and Artifacts Feed

Here is what I am trying to accomplish:

1. I have a Python package that I have published to an Azure Artifacts feed.


2. I want to install this Python package on a self-hosted Azure DevOps agent as part of a pipeline.


3. The pipeline will run some code that depends on the Python package.


To do this, I need to authenticate with the Azure Artifacts feed from the pipeline running on the self-hosted agent. Any ideas on how to go about this? I do not want to use a Personal Access Token PAT because it is linked to a user and not Group or Service account.




https://redd.it/1djpgjv
@r_devops
How to transition into a cloud based DevOps roles?

I am currently almost 2 years into working for an org with 0 exposure to cloud (the product uses cloud for data storage which is managed by the backend teams). We work on ensuring availability and making configuration management changes to servers and Jetsons on edge. I have worked on Bash and Python scripting, a lot of troubleshooting ubuntu running Cuda, Docker, Ansible, Grafana and Prometheus setups among others.

I'm trying to move out from this org for better pay but nobody's really doing similar work, and cloud native orgs are not really giving a chance for even an interview/call.

Should I rack in more experience at the same org and wait around being patient or am I going wrong somewhere?

https://redd.it/1djqs04
@r_devops
Recommend a monitoring alternative

We're spending 500k a year on datadog metrics + logging

We'd like to run our own regional kubernetes cluster where we can provide monitoring services for multiple products


Is there any stack someone could recommend that replaces the functionality of datadog but has a permissive enough license to self host (for internal use only, not providing access to tenants), and it should be able to use kubernetes to scale.

https://redd.it/1djsd94
@r_devops
Does anyone work with technical specifications document while freelancing in DevOps

Hello,

I was wondering if freelancers in DevOps do use technical specifications documents.

In French, we call it « Cahier des charges » and it’s a document that you’ll have as a template to specify each requirement and specification about what you are going to do in your task.

So this template will have some questions about the environment setup on the client’s architecture and what exactly are the results and preferences they want, then you evaluate it and decide what are the requirements and how much will it cost in total ( for you job + requirements fees )

Do you do that, if yes do you have any kind of templates you use or used before ?

Thank you in advance 🌹

https://redd.it/1djjogv
@r_devops
I Build Iudex, Low Cost Observability with OTel and AI

Hey folks, I'm the founder of Iudex, and we're building a super affordable observability platform to tackle the headaches of high log retention costs, confusing configurations, and solving log correlations using AI. We've designed Iudex to be low cost because we know how important it is not to have to think about how much data you're sending. We run completely on OTel, so if you're already using it, adding our exporter is trivial. And, we're focused on giving teams a low time to value. Our current features include: built-in log attribute filtering, keyword search, natural language search, traces, and service level metrics.

We're doing a soft launch and are looking for users who want good observability tooling but don't want to overpay for them. Join our waitlist, and as a thank you for being a beta user, you'll get 100 million logs per month for free!

https://redd.it/1dkq68c
@r_devops
Should I .gitignore everything by default?

I have started working with a team recently and the default behaviour is to git ignore everything and then add a not ignore to the .gitignore file for my content to be committed.

I have never worked like this and I am really struggling as no one can tell me why this is the pattern.

Is this common practice? Help? Please 🥲

https://redd.it/1dk6of3
@r_devops
What are your strategies when feeling overwhelmed?

I've been an SRE for a few years, thought I would shake it after a few years experience, but it still happens where I feel overwhelmed with things.

Not so much with the amount of work, which seems endless and I'm always playing catchup, but much more so with learning more and more of the technologies and it feels like I'm always behind.

https://redd.it/1dkuj2w
@r_devops