Reddit DevOps
270 subscribers
2 photos
31K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
In need of extracurricular advice

SO, obviously there's the coding projects, hackathons, robotics, literally anything tech-y would help, but apparently everybody has those. I want your opinion in a few clubs (less common for people in tech) that maybe tech companies could see value in? (in addition to all the other tech-related side projects/ clubs):

1. Finance clubs in general, but more specifically investments and stocks (for the math, data analysis side)
2. Model United Nations experience (for the critical thinking, efficient communication, analysis, negotiation)

I'm interested in big tech companies for software roles.

https://redd.it/y6rstf
@r_devops
Help with CI/CD using multi-repo in Github Actions

Hello everybody,

We have a project composed of frontend (Service A), backend (Service B), and one microservice for processing data (Service C).

I have setup github workflows in each repository that automatically tests, builds and pushes docker images to repository.

The part I am missing is:


How do I test/deploy the product using all of the microservices? I want to setup a staging environment where I do e2e tests on all of the deployed services before I deploy them to production.

For example, I update Service C to version 1.5.0 and I want to test the compatibility with the current version of the other services.

My idea so far is to have a ''main repo'' for deployment and e2e testing scripts and a manifest file containing the current versions the pre-release service. On successful build the workflow of Service C will update the its version on the ''main repo'' and create a new pull request. This will trigger a workflow for deploying on staging environment, and if everything passes correctly, another workflow will be triggered for deploying into produciton.

Is there a better approach to this solution? I am having trouble understanding the staging environment as a part of the CI/CD process.

https://redd.it/y6cycp
@r_devops
Company has a crazy plan to bring sales and developers together, is it realistic?

Well to put simply, my company is primarily a software development company that works sometimes with external contractors to fulfill feature requests and fix bugs if our in house team is overloaded. We collaborate on a range of platforms but our primary project tracker in-house is Azure DevOps. Now is the time of year where we are in a mad rush to finish features before the holidays arrive and today management had a plan to help us (yes I was reluctant).

The plan was to integrate AzDo with some crm software i have never heard of (Dynamics365?) and allow people who are not technical (assume a sales person) gather feature/bug requests from customers. How the workflow would look: sales person asks what could be better > takes this data and dumps it into a crm ticket? > this will translate within AzDo as a feature/user story/bug >we complete this request and close it in AzDo > and then it auto closes out in this crm software.

My huge pain is the feature needs to be more specific and at a granular level which sales people can't seem to comprehend. I am probably just overthinking this and maybe features can be a bit broad to scope. I have a feeling though I am going to get alot more feature requests that I am going to have to write user stories for. My questions for you guys:

Is this even a real world solution for software dev companies?
Are any of you guys out there using a system like this?
Any issues you see or am I just overthinking logistics?

https://redd.it/y6z80v
@r_devops
Super light monitoring and alerting stack for personal projects

I run a small single-VPS web application. My CI/CD also runs on the same VPS on a cronjob. Maybe this setup will expand to two VPS in the future but at present high availability is not a requirement.

I dabbled a bit with self-hosting Elastic stack on another VPS but honestly Kibana seems super bloated and the resource requirements of the stack are greater than the system that it is supposed to monitor. Also played around with Grafana's cloud monitoring stack but like Elastic it felt bloated and overpowered for what I'm doing.

What I need:

- Application monitoring like response codes, latency and other application metrics. This could be instrumented in my code or written to a log file that is shipped and parsed by something like Filebeat.

- Basic system monitoring - CPU, memory, restarts etc.

- Set alert conditions that trigger a notification on PagerDuty or an SMS.

What is the lightest, bare minimum monitoring stack that can do this?

https://redd.it/y6zy58
@r_devops
Fastest, financially efficient way to deploy GPUs?

I've been working for weeks now to find the best way to deploy cloud gpus. There are lots of great serverless options these days, but unfortunately none meet the needs of my use case of being able to deploy individual ML models and control each instance. I've been tinkering with deploying virtual machines with gpus on Google cloud, but it takes far too long for my use case. Looking into using kubernetes, but it's not quite clear to me exactly the best way to set it up. Does anyone have any good ideas on the best way to deploy a GPU for ML inference in the fastest most economical way?

https://redd.it/y70kh1
@r_devops
What VSCode Customizations and Plugins do you use for optimal workflow?

I'm just interested in what you think you can't work without. What your essentials are.

May it be extensions, themes, other customizations or certain systems for optimized workflow.

https://redd.it/y6z5mf
@r_devops
A discussion about optimal DB design for saving aggregations

I have raw data.

There are 5 filters defined on this raw data: user_id,f2...f5.

There are 50k\~ permutations of f2...f5 filters.

There are 365 time windows. Each time window is 1 day.

For each time window, for each filter permutation, 50 different aggregations are saved.

I think of saving this data in a single aggs table, with a schema as follows:

user_id | f2 | ... |f5 | agg1 | ... | agg50 | start_time | end_time

That means every user, on average has 50k * 365 = 20m\~ rows.

When I query the data, in 100% of the time I will filter user_id, f2..f4.

Most of the queries are performing some aggregation on top of some agg column.

i.e, select count(agg1) where user_id=... f2..f5=... agg1=... start_time=...

Filter f5 and timestamps (start/end_date) are optional for filtering.

That means the query is running on a pretty "small" number of rows.

Some queries will update some rows and backfill missing ones.

There might be endless distinct user_id of course.

1. I assume I need to index the table by user_id,f2..f4 at least. Should I partition it also in some way? maybe by user_id?
2. Should I use time-series DB? Are there any benefits in this compared to "raw" Postgres?
3. As the number of user_id increases to infinity, this table will bloat. I assume this will deteriorate the performance of the DB. What solutions I may apply to keep the querying fast? Retention? Compression?

Maybe the whole DB design should be different. Please express your general opinion on it.

https://redd.it/y7apj3
@r_devops
Guys, I'm building a docker image to build my project APK. I have a small doubt.

I made a docker image to build my project APK. The docker image is 3.4 GB in size. I want to know if there's any way to download android SDK during compile time from docker. As this will considerably reduce the size of the image. As of now, i used the command line SDK manager to install a SDK on the container.

https://redd.it/y7d3ku
@r_devops
Currently stuck at work! Can anyone help me understand what the heck SDI is? (Server-Defined Infrastructure)

Hi everyone.

I was recently tasked by my current employer to learn about SDN & SDI's. Specifically, the product below. Now, if only I could understand what the heck I'm looking at. I have a related "background", but I'm super lost. Can anyone help me break these down? Here's an example...

Lenovo Server-Defined Infrastructure (**https://www.lenovo.com/us/en/servers-storage/servers/?orgRef=https%253A%252F%252Fwww.google.com%252F**):

ThinkAgile HX Series: Designed for easy deployment and manageability, Lenovo ThinkAgile HX combines Nutanix software with Lenovo’s #1 reliable, high performing platforms.
ThinkAgile MX Series: ThinkAgile MX Certified Nodes accelerate your adoption of Microsoft Azure Stack HCI solutions (Storage Spaces Direct) by streamlining the ordering process for validated configurations with an easy-to-use machine type.

Any direction or input is appreciated. Thank you all so much!

https://redd.it/y7dtze
@r_devops
How are you building docker images for Apple M1?

Hello,

my org has recently begun adopting M1 MacBooks and our developers are using some external docker images as part of their dev workflow. Unfortunately these images are build for amd64 and while they work, they aren't really fast. I've been working on rebuilding these images in my CI infrastructure and I want to also build them for ARM.

How have you solved this?

https://redd.it/y75ouj
@r_devops
leave a full time job for an internship

I am a third year student of computer science university.
I have always worked full time in recent years to pay for my studies in a not very famous Italian consulting firm as a system administrator.

I received the opportunity to do an internship for some important companies (all tier 1 companies, not faang unfortunately) top investment banking in London for 2 months (pay me 3x my salary lol).

I would like to ask for an unpaid leave and have this experience and if they don't allow me I will resign.

I want to continue to apply to other positions maybe to have 6 months of internship.

My goal is to work one day in FAANG after i will graduate in 2024 in Italy.
Having these names in my CV even if for only 2 or 6 months will help me much more in my opinion.
But I'm afraid of leaving my full time job and after the internship not have a job for one year and half.

Is it worth it?

What do you think about it?

https://redd.it/y7f0q1
@r_devops
Ready-to-use Day 2 production recipe?

I want to set up an opinionated infrastructure that's pretty much ready to use without having to do a lot of extra work.
We're currently using EKS k8s in production and development but I'd like to stop worrying about nodes, or get specialists to monitor and manage the nodes, as well as all the other bits needed for a full production environment.

Things like firewalls, load balancers, mesh, autoscaling, observability, DNS, etc.

The recipes could be pulumi (any flavor), terraform or anything else except CloudFormation since that never works.

It could even be a paid service.

I like the idea of platform9 and Rafay, but they're limited to kubernetes.

Is there a service out there, or a production ready templates that can be applied to set up a new infrastructure that's ready for production?

https://redd.it/y7ieb6
@r_devops
What are some DevOps best practices - Development environment, staging, and production for infrastructure as code

Hi,


What are the best practices when deploying infrastructure as code for a larger organization with various products and development teams?


In particular, deploying to the different stages: development, staging and production.


How do development teams interact with the DevOps engineering teams?


When a development team develops software, does the DevOps team provision the required cloud resources for that team?


Thank you

https://redd.it/y7le8h
@r_devops
Software engineers who transitioned into DevOps, or vice versa... do you like the choice you made? Any regrets?

title.

How has your experience been going from SWE ---> DevOps or from DevOps ---> SWE ?

https://redd.it/y7khf2
@r_devops
DevOps without support

We are in the middle of a transition to a DevOps-environment on-premise and in the cloud.
Some teams will need continuous support because they basically are not interested in anything infrastructure-related.
Do we still need an overall support-team for all these teams that do not want to manage or setup their own infrastructure ?

https://redd.it/y7eree
@r_devops
Read-only access for a user

Hi All, has anyone here used postgresql_default_privileges resource? I have a user (iamreadonly) who has only select privileges defined but the issue is that iamreadonly user is able to create table which shouldn't happen.

Also, iamreadonly user should be able to access tables created from other user (application-user) which I have mentioned in the below block of code. The application-user has ALL permissions, is this being inherited to iamreadonly user also? Can anyone tell me if I'm missing anything here?

resource "postgresql_default_privileges" "readonly_all_tables" {

role = "readonlyrole"
database = "db-name"
schema = "public"

owner = "application-user"
object_type = "table"
privileges = ["SELECT"]
}

https://redd.it/y7qpul
@r_devops
Aws recommended all nodejs v12 lambda to be upgraded to nodejs v16 before mid November? How can approach that in safe workflow and should I use fusebit

Using cloud formation to update the lambda or cli ?

https://redd.it/y7nvv2
@r_devops
Gradual Update of the AWS Java SDK in the SpringBoot Project

#

Recently, in our project, we decided to update the AWS Java SDK from 1.x to 2.x, so we are able to use client-side metrics, available only in the newer version of the SDK.

​

Our whole system is AWS based, so we didn’t want to perform this update at once. We decided to do it granularly instead.

Fortunately, AWS SDK allows us to use both versions side by side.

## Preparation for AWS Java SDK update

In our project, we implemented an abstraction layer over AWS services, like:

QueueSender over SqsClient with AwsQueueSender implementation
QueuePublisher over SnsClient with AwsQueuePublisher implementation
ExternalStorage over S3Client with AwsFileUploader implementation

The approach of introducing an abstraction layer over external services and frameworks comes in really handy, especially in cases like ours — changing the implementation of these abstractions.

The first thing we did was to add a new implementation for these services using SDK v2, so we added AwsQueueSenderV2, AwsQueuePublisherV2 and AwsFileUploaderV2.

## Challenges

Some libraries that we used to implement our services don’t support SDK V2(or don’t support both versions side by side), so we needed to fork these libraries and adjust them for our needs. These are public repositories, so if you are planning to migrate your project, you could use:

amazon-sns-java-extended-client-lib
[amazon-sqs-java-extended-client-lib](https://github.com/bright/amazon-sqs-java-extended-client-lib/tree/sdk-v2-support)
amazon-sqs-java-messaging-lib

Then we could copy all the tests that were testing the original implementation, so we could test my new implementation.

Running tests over a new implementation allowed me to find a bug in my implementation — I messed up the order of parameters 🙈.

## Migration from 1.x to 2.x of the AWS Java SDK

We decided to take advantage of Spring capabilities to gradually replace old AWS services implementations with the new ones, and for that we used the @Priority annotation.

We annotated 1.x Beans implementations with u/Priority(1) and 2.x implementations with u/Priority(2). Then we deployed our application to a test environment and monitored if there were no unexpected changes. After verifying it, we deployed the application to the production environment and continued monitoring to confirm that everything is still fine.

In the next step, we chose a couple of non-business critical functionalities and replaced old services with the new ones, using the @Named annotation. After repeating the deployment and monitoring steps, we were sure our new implementations were working as expected, so we could release the application with all AWS Beans updated. We did this by changing the priority of 1.x Beans from @Priority(1) to @Priority(3).

## Cleanup

Everything went well, so we could remove temporary annotations, 1.x implementations, and V2 suffixes from 2.x implementations.

## Summary

Although we took a couple of extra steps, we were able to introduce an advanced update to our production-ready application without downtime or risk of introducing breaking changes. This way is much safer and allows us to avoid making mistakes that can affect our customers.

https://redd.it/y7ylw0
@r_devops
Saga continues—DevOps vs politics

I work as part of a team responsible for automating the install and configuration of a pretty complex set of applications across groups of servers. We basically ensure all requirements are met for security, compliance, hardening, etc. Upstream devs build the application containers while a sysadmin ops group builds the OS templates our team uses. Our team also manages relationship with clients who rely on the apps. We took great pains to explain to clients why DevOps method is needed to reduce risk, and increase reliability and uptime. Clients have been happy with DevOps method and have been allowing us time to test changes properly before release to prod.

Recently, client group made some internal changes and are now going directly to sysadmin group for rights to make config changes outside of DevOps pipelines. Client representative claims our DevOps method is preventing ad-hoc changes just to test something. Our team is then asked to load small changes in after the fact. Our team has explained to clients that’s what a test environment is for. We didn’t go so far as to say what they are doing is backwards. Risk and workload are increased, but how do you convince a now hostile client enabled by a sysadmin group that doesn’t value DevOps?

https://redd.it/y82620
@r_devops
Which Lets Encrypt client to use?

Dear fellow DevOps,

I am currently trying to run a service (Vault to be more precise), that's in a private subnet, but should have an SSL certificate.

We're currently running on GCP and use acme.sh. We have two projects, one for the service it self where it can store secrets and another project as ACME project to use the DNS alias mode.

The machines are managed in a Managed Instance Group and behind an internal L4 Loadbalancer

The process now looks like this:

1. Cloud init creates the gcloud profiles (via switching CLOUDSDK_ACTIVE_CONFIG_NAME)

2. Configure the systemd service files and packages via salt

2. In the startup script of the VM I activate the default profile and fetch the Lets Encrypt key to rest the limits from Lets Encrypt. Put the key in place

3. Install acme.sh

4. activate the get-certificate profile and kick off the certificate request

5. Change back to default profile and upload the LE key if it was empty in the beginning

In the end we use Caddy to reverse proxy to the service.

Unfortunately the problem is, that the cron job from acme.sh does not use the get-certificate profile and I'd have to customize that to renew certificates.

What client do you recommend in combination of GCP and DNS alias mode?

Thanks and BR

https://redd.it/y807h3
@r_devops
DevOps & Pipeline Runners: The Key to Sustainable App Development

DEVOPS WEBINAR!

📌Have you ever wondered how DevOps experts choose the best Pipeline Runners for the job? Join us as three DevOps experts discuss the tools they're using now and weigh the pros and cons of the most popular DevOps tools available.

📷 Save your seat! https://my.demio.com/ref/P1vnTtR1cOvEHVTz

https://redd.it/y88ejv
@r_devops