How we brought automated Rollbacks to 2,100+ services using Argo Rollouts
Hey everyone đź‘‹ I work in the Backend Platform team at Monzo.
We've written about how our team brought automated rollbacks to our deployment system. This is the most substantial change we’ve made to our deployment system in some time, so it was not without its challenges!
At the heart of this new feature is Argo Rollouts \- a Kubernetes extension that supports advanced deployment strategies. In this post we dig into how we integrated Argo Rollouts with our existing deployment tooling, while keeping the Monzo delight factor. We show how we migrated all 2,000+ services to this new system and discuss the lessons we learnt along the way.
đź”— Here's the link: https://monzo.com/blog/2022/11/02/argo-rollouts-at-scale
We’d love to hear your thoughts and questions.
https://redd.it/yox68a
@r_devops
Hey everyone đź‘‹ I work in the Backend Platform team at Monzo.
We've written about how our team brought automated rollbacks to our deployment system. This is the most substantial change we’ve made to our deployment system in some time, so it was not without its challenges!
At the heart of this new feature is Argo Rollouts \- a Kubernetes extension that supports advanced deployment strategies. In this post we dig into how we integrated Argo Rollouts with our existing deployment tooling, while keeping the Monzo delight factor. We show how we migrated all 2,000+ services to this new system and discuss the lessons we learnt along the way.
đź”— Here's the link: https://monzo.com/blog/2022/11/02/argo-rollouts-at-scale
We’d love to hear your thoughts and questions.
https://redd.it/yox68a
@r_devops
Monzo
Monzo | Your New Favourite Bank
Organise, save & invest with a free UK current account, joint account or business account. Make your money more Monzo.
Hi r/devops, how would you write end-to-end system tests for a system comprised of multiple java apps connected by kafka and with multiple databases? I have managed to run the whole system in docker for development. Now I need a framework to write tests cases as below and run them in docker.
app_1 --> kafka_topic_1 --> app_2 --> kafka_topic_2 --> app_3 -> postgres_db
example test case: app_1 publishes a message + assert new db entry created
https://redd.it/yown5v
@r_devops
app_1 --> kafka_topic_1 --> app_2 --> kafka_topic_2 --> app_3 -> postgres_db
example test case: app_1 publishes a message + assert new db entry created
https://redd.it/yown5v
@r_devops
reddit
Hi r/devops, how would you write end-to-end system tests for a...
app\_1 --> kafka\_topic\_1 --> app\_2 --> kafka\_topic\_2 --> app\_3 -> postgres\_db **example test case:** app\_1 publishes a message + assert...
Serverless Containers, forced to do microservice? Per entity or per operation?
Tech stack if it matters: Fastify GraphQL Docker image.
I have monolith application that was initially on Google Cloud Run and cold start was pretty bad. But now that I’m thinking, it was probably because my container was a monolith.
Now plan on migrating to AWS, lambda can use Docker containers. Was watching AWS talks, that you should keep everything small to reduce cold start. Please note: I do not want to use AWS App Sync, I want my GraphQL schema to be with my application and not with AWS for cloud agnostic. But then again I have to make my Docker containers specific to Lambda image I think.
Should AWS Lambda containers be treated the same as Google Cloud Run? They are essentially the same right?
Back to the main question, with either AWS lambda containers or Google Cloud Run containers. Both serverless containers, I am pretty much forced to do microservice just to have small cold start, correct?
Do I break these down per entity? Or per method? A container for CRUD user (4x Lambdas) or 1x container for user entity and all its methods?
https://redd.it/yos4wp
@r_devops
Tech stack if it matters: Fastify GraphQL Docker image.
I have monolith application that was initially on Google Cloud Run and cold start was pretty bad. But now that I’m thinking, it was probably because my container was a monolith.
Now plan on migrating to AWS, lambda can use Docker containers. Was watching AWS talks, that you should keep everything small to reduce cold start. Please note: I do not want to use AWS App Sync, I want my GraphQL schema to be with my application and not with AWS for cloud agnostic. But then again I have to make my Docker containers specific to Lambda image I think.
Should AWS Lambda containers be treated the same as Google Cloud Run? They are essentially the same right?
Back to the main question, with either AWS lambda containers or Google Cloud Run containers. Both serverless containers, I am pretty much forced to do microservice just to have small cold start, correct?
Do I break these down per entity? Or per method? A container for CRUD user (4x Lambdas) or 1x container for user entity and all its methods?
https://redd.it/yos4wp
@r_devops
reddit
Serverless Containers, forced to do microservice? Per entity or...
Tech stack if it matters: Fastify GraphQL Docker image. I have monolith application that was initially on Google Cloud Run and cold start was...
What's an outdated hiring practices that companies should get rid of?
Title.
https://redd.it/yp3gap
@r_devops
Title.
https://redd.it/yp3gap
@r_devops
reddit
What's an outdated hiring practices that companies should get rid of?
Title.
CyberSec Question - How do I implement secure installation of a debian package?
Hi,
​
I am currently working on some project and I hit a wall and not sure how to proceed.I have a software that creates a Debian package by running through multiple BB repositories. That package is later transferred to an offline system (no internet access). I then run dpkg to install the package.
​
Now the thing is, I want to make sure that there is some sort of verification for this procedure. I want dpkg to only go through for THIS specific debian, and for future debians I create using the software - not just any debian it is given. I also want specific user to be able to perform this installation so I want to put NOPASSWD line in sudoers.d/user file for dpkg command to allow the user to install this debian, but only if verification goes through. I could just go with adding dpkg [filename\] in sudoers file but file name is not good enough.
​
I am not really good at cybersec, so please give me some ideas on how to proceed. Thank you!!
https://redd.it/yoo1ad
@r_devops
Hi,
​
I am currently working on some project and I hit a wall and not sure how to proceed.I have a software that creates a Debian package by running through multiple BB repositories. That package is later transferred to an offline system (no internet access). I then run dpkg to install the package.
​
Now the thing is, I want to make sure that there is some sort of verification for this procedure. I want dpkg to only go through for THIS specific debian, and for future debians I create using the software - not just any debian it is given. I also want specific user to be able to perform this installation so I want to put NOPASSWD line in sudoers.d/user file for dpkg command to allow the user to install this debian, but only if verification goes through. I could just go with adding dpkg [filename\] in sudoers file but file name is not good enough.
​
I am not really good at cybersec, so please give me some ideas on how to proceed. Thank you!!
https://redd.it/yoo1ad
@r_devops
reddit
CyberSec Question - How do I implement secure installation of a...
Hi, ​ I am currently working on some project and I hit a wall and not sure how to proceed.I have a software that creates a Debian package...
Are there "Configuration Manager" solution out there?
Hi
I am not sure if "Configuration Manager" is the correct term.
I deploy my infrastructure as code by using a JSON file as a parameter file for each environment. I was wondering if there was any "Configuration Manager" solution on the market. I am thinking of a solution that would provide a user interface with the ability to create a "form" and add fields with its type (drop-down, int, string). Then, a user could create "new environment", fill and select values in the form and click "Save". The records would be saved in a backend database and the pipeline would be designed to retrieve the records from the database.
The closest I can think is Azure DevOps Variables Groups, but it does not support value type and validation, cannot have a drop-down menu for example.
Thank you
https://redd.it/yp693i
@r_devops
Hi
I am not sure if "Configuration Manager" is the correct term.
I deploy my infrastructure as code by using a JSON file as a parameter file for each environment. I was wondering if there was any "Configuration Manager" solution on the market. I am thinking of a solution that would provide a user interface with the ability to create a "form" and add fields with its type (drop-down, int, string). Then, a user could create "new environment", fill and select values in the form and click "Save". The records would be saved in a backend database and the pipeline would be designed to retrieve the records from the database.
The closest I can think is Azure DevOps Variables Groups, but it does not support value type and validation, cannot have a drop-down menu for example.
Thank you
https://redd.it/yp693i
@r_devops
reddit
Are there "Configuration Manager" solution out there?
Hi I am not sure if "Configuration Manager" is the correct term. I deploy my infrastructure as code by using a JSON file as a parameter file...
Anyone an expert in APM (Application Portfolio Management)??
Hi, I need to build an excel file of all our business and tech applications (APM) STYLE with details...anyone done this before and have a template of sorts? Thanks.
https://redd.it/yp7it3
@r_devops
Hi, I need to build an excel file of all our business and tech applications (APM) STYLE with details...anyone done this before and have a template of sorts? Thanks.
https://redd.it/yp7it3
@r_devops
reddit
Anyone an expert in APM (Application Portfolio Management)??
Hi, I need to build an excel file of all our business and tech applications (APM) STYLE with details...anyone done this before and have a template...
Datadog Cost Optimization Tips
Hi folks! This sub provided inspiration for my company to add Datadog as an integration to our product so this is my attempt to return the favor.
This is a list of Datadog cost optimizations we have put into practice with customers and generally a collection of tips that experienced SREs seemed to know about but that we could not find listed publicly anywhere. Hope you find it helpful and please comment if there are more we are missing: https://www.vantage.sh/blog/datadog-cost-optimization-tips
https://redd.it/yp2xz4
@r_devops
Hi folks! This sub provided inspiration for my company to add Datadog as an integration to our product so this is my attempt to return the favor.
This is a list of Datadog cost optimizations we have put into practice with customers and generally a collection of tips that experienced SREs seemed to know about but that we could not find listed publicly anywhere. Hope you find it helpful and please comment if there are more we are missing: https://www.vantage.sh/blog/datadog-cost-optimization-tips
https://redd.it/yp2xz4
@r_devops
www.vantage.sh
Datadog Cost Optimization Tips
Review these six best practices for controlling Datadog costs.
What are some of the best ways to handle someone’s ego at work?
Serious question…
There must be an appropriate way to handle peoples’ egos without causing yourself problems…
I’m a Senior Staff level SRE, I’m fairly introverted, and I’m new to this DevOps (infrastructure) team at a very tiny startup company that’s trying to grow to enterprise level. I come from a history of working at much larger enterprises and doing legitimate SWE and actual product SRE/DevOps, and a situation has come up at work where one of the “rockstars” on the team has “corrected” me on something that they really don’t know anything about… and now their inaccuracy is going to cause misinformation throughout the org as well as some production issues that aren’t exactly trivial... But if I correct them, I fear that I may piss them off, may actually make myself out to be an asshole, and put a target on my back. This feels like a catch-22.
Moreover, I’ve already learned that this engineering org is toxic, and management is even more toxic (by far). I want to avoid leaving; I’m actually fairly stimulated by the challenge of surviving (and even possibly thriving) in a bad culture.
How have you been successful in dealing with a person’s ego in a situation where you feel compelled to speak up, especially when that person holds status within the org?
TL;DR: Just need the question in the title answered.
https://redd.it/yp8xbb
@r_devops
Serious question…
There must be an appropriate way to handle peoples’ egos without causing yourself problems…
I’m a Senior Staff level SRE, I’m fairly introverted, and I’m new to this DevOps (infrastructure) team at a very tiny startup company that’s trying to grow to enterprise level. I come from a history of working at much larger enterprises and doing legitimate SWE and actual product SRE/DevOps, and a situation has come up at work where one of the “rockstars” on the team has “corrected” me on something that they really don’t know anything about… and now their inaccuracy is going to cause misinformation throughout the org as well as some production issues that aren’t exactly trivial... But if I correct them, I fear that I may piss them off, may actually make myself out to be an asshole, and put a target on my back. This feels like a catch-22.
Moreover, I’ve already learned that this engineering org is toxic, and management is even more toxic (by far). I want to avoid leaving; I’m actually fairly stimulated by the challenge of surviving (and even possibly thriving) in a bad culture.
How have you been successful in dealing with a person’s ego in a situation where you feel compelled to speak up, especially when that person holds status within the org?
TL;DR: Just need the question in the title answered.
https://redd.it/yp8xbb
@r_devops
reddit
What are some of the best ways to handle someone’s ego at work?
Serious question… There must be an appropriate way to handle peoples’ egos without causing yourself problems… I’m a Senior Staff level SRE, I’m...
Is NixOS a thing?
I've been falling in a nixos rabbit hole for a few days for now. Want to ask if it's somehow good for production and deployment.
RN using puppet to manage all my servers, but that nixos approach looks magnificent to me.
Does NixOS has tools like hiera for managing multiple machines from same repo and including manifests as packages?
Is NixOPS mature enough today?
https://redd.it/ypb3pg
@r_devops
I've been falling in a nixos rabbit hole for a few days for now. Want to ask if it's somehow good for production and deployment.
RN using puppet to manage all my servers, but that nixos approach looks magnificent to me.
Does NixOS has tools like hiera for managing multiple machines from same repo and including manifests as packages?
Is NixOPS mature enough today?
https://redd.it/ypb3pg
@r_devops
reddit
Is NixOS a thing?
I've been falling in a nixos rabbit hole for a few days for now. Want to ask if it's somehow good for production and deployment. RN using puppet...
Governance Azure Policy to set WAF IP restrictions
I'm attempting to stop deployments of app services if they do not have the proper WAF custom rules of our ip restrictions for our FD that they are pushed through. I started writing some powershell for this but Azure policy would be best. If not Azure policy, I would like to mimic policy behavior as much as possible. I was initially told I couldnt do this with policy because the solution im trying would need to major resources to understand eachothers logic....
Is the only way to go about this to maybe delete the app service and not block deployment? This kind of seems overboard and not appropriate towards the app service devs. How often can this run? Can it be triggered by app service deployments? Can this be applied to just a single subscription? Etc....it would be great if it can auto enforce it
https://redd.it/ypaqkw
@r_devops
I'm attempting to stop deployments of app services if they do not have the proper WAF custom rules of our ip restrictions for our FD that they are pushed through. I started writing some powershell for this but Azure policy would be best. If not Azure policy, I would like to mimic policy behavior as much as possible. I was initially told I couldnt do this with policy because the solution im trying would need to major resources to understand eachothers logic....
Is the only way to go about this to maybe delete the app service and not block deployment? This kind of seems overboard and not appropriate towards the app service devs. How often can this run? Can it be triggered by app service deployments? Can this be applied to just a single subscription? Etc....it would be great if it can auto enforce it
https://redd.it/ypaqkw
@r_devops
reddit
Governance Azure Policy to set WAF IP restrictions
I'm attempting to stop deployments of app services if they do not have the proper WAF custom rules of our ip restrictions for our FD that they are...
Should I run if networking is created by hand in a Terraform-backed project?
So, I am in a project which has Terraform in the stack, but we don’t have permissions to various things from the VPC category, which means Terraform cannot deploy pur network fully.
Should I run from the project? What are your thoughts?
https://redd.it/yp0vtm
@r_devops
So, I am in a project which has Terraform in the stack, but we don’t have permissions to various things from the VPC category, which means Terraform cannot deploy pur network fully.
Should I run from the project? What are your thoughts?
https://redd.it/yp0vtm
@r_devops
reddit
Should I run if networking is created by hand in a...
So, I am in a project which has Terraform in the stack, but we don’t have permissions to various things from the VPC category, which means...
is it alright to build app on same vps that it is running on ?
is it alright to build app on same vps that it is running on ?
https://redd.it/ypes9u
@r_devops
is it alright to build app on same vps that it is running on ?
https://redd.it/ypes9u
@r_devops
reddit
is it alright to build app on same vps that it is running on ?
What are some of the most unconventional job titles for devops/cloud engineer that you have come across?
I'll go first.
Recently I saw a LinkedIn post of someone who had their tittle set as 'Chief Devops Wizard'.
https://redd.it/ypgiga
@r_devops
I'll go first.
Recently I saw a LinkedIn post of someone who had their tittle set as 'Chief Devops Wizard'.
https://redd.it/ypgiga
@r_devops
reddit
What are some of the most unconventional job titles for...
I'll go first. Recently I saw a LinkedIn post of someone who had their tittle set as 'Chief Devops Wizard'.
New to DevOps
I have been a full stack developer for about 5 years now and recently moved to a new company. I knew that they didn't have a DevOps team upon interviewing with them but I didn't realize how bad it was. Since I had experience with some DevOps principles at my last job, I had some suggestions as to what could be changed. This led them to ask me to be their DevOps engineer as well (since they didn't have budget to hire one). I was happy to do this because I find DevOps very interesting and look forward to learning more.
That being said, I have no idea where to begin. I have begun to add insight to their code with logs and tracing but I don't feel like that is really DevOps, it's just necessary.
Things aren't containerized, their deployment is very manual, IaC is non-existant and lots of other things.
My question is, where do I start? What is a good base so that I can begin to bring things into the modern era, that is also easy enough for someone with little DevOps experience?
Note: We do use AWS but not to its fullest extent. Also, getting some consultant time is a hard sell.
Any advice would be very appreciated!
https://redd.it/yoy9wo
@r_devops
I have been a full stack developer for about 5 years now and recently moved to a new company. I knew that they didn't have a DevOps team upon interviewing with them but I didn't realize how bad it was. Since I had experience with some DevOps principles at my last job, I had some suggestions as to what could be changed. This led them to ask me to be their DevOps engineer as well (since they didn't have budget to hire one). I was happy to do this because I find DevOps very interesting and look forward to learning more.
That being said, I have no idea where to begin. I have begun to add insight to their code with logs and tracing but I don't feel like that is really DevOps, it's just necessary.
Things aren't containerized, their deployment is very manual, IaC is non-existant and lots of other things.
My question is, where do I start? What is a good base so that I can begin to bring things into the modern era, that is also easy enough for someone with little DevOps experience?
Note: We do use AWS but not to its fullest extent. Also, getting some consultant time is a hard sell.
Any advice would be very appreciated!
https://redd.it/yoy9wo
@r_devops
reddit
New to DevOps
I have been a full stack developer for about 5 years now and recently moved to a new company. I knew that they didn't have a DevOps team upon...
GCP Associate Cloud Engineer
How much would it take for someone to prepare for this exam?
I have work experience with AWS (cloud practitioner and solutions architect associate certs also)
It's very different then AWS or it more just the naming of the services?
https://redd.it/ypirmp
@r_devops
How much would it take for someone to prepare for this exam?
I have work experience with AWS (cloud practitioner and solutions architect associate certs also)
It's very different then AWS or it more just the naming of the services?
https://redd.it/ypirmp
@r_devops
reddit
GCP Associate Cloud Engineer
How much would it take for someone to prepare for this exam? I have work experience with AWS (cloud practitioner and solutions architect...
DevOps best practices - Staging environments
Hi,
I am new to DevOps and learning about the different staging environments.
I find it hard to find a single authoritative source that I can read on the best practices and which is the best approach to take.
My knowledge comes from anecdotes and talking with colleagues.
What I have so far is :
Dev/Non-Prod/Production environments
Blue/Green Deployment
Which type of process should be applied, and how do you technically implement these different environments? Do you have a single repo, and a branch for each environment?
To get some further light on this would be great!
https://redd.it/yk8j36
@r_devops
Hi,
I am new to DevOps and learning about the different staging environments.
I find it hard to find a single authoritative source that I can read on the best practices and which is the best approach to take.
My knowledge comes from anecdotes and talking with colleagues.
What I have so far is :
Dev/Non-Prod/Production environments
Blue/Green Deployment
Which type of process should be applied, and how do you technically implement these different environments? Do you have a single repo, and a branch for each environment?
To get some further light on this would be great!
https://redd.it/yk8j36
@r_devops
reddit
DevOps best practices - Staging environments
Hi, I am new to DevOps and learning about the different staging environments. I find it hard to find a single authoritative source that...
Distributed Tracing in 2025: What the future holds
Has Distributed Tracing arrived? All you need to know about the current state and future of Observability
https://keyval.dev/distributed-tracing-2025/
https://redd.it/yplenl
@r_devops
Has Distributed Tracing arrived? All you need to know about the current state and future of Observability
https://keyval.dev/distributed-tracing-2025/
https://redd.it/yplenl
@r_devops
keyval.dev
Distributed Tracing in 2025: What the future holds
This blog will predict developments in distributed tracing that will accelerate its widespread adoption, vital to achieving end-to-end observability.
CICD strategy with UAT
Hi Guys
​
usual approach:
We usually use default or slightly modified git branch strategy with feature-dev-master branches
we create features from dev and put it into dev. After some time Code freeze is declared, dev is "locked", tested by QA and then pushed into master. Master is considered prod-ready and packages built from it are shared with clients.
​
current project approach:
On another project that I joined, my client provides a website to his own clients. Clients upload data that is transformed and prepared to be consumed as files and reports. Their logic is mainly separated but there are some common parts. So some parts may intervene with each other(!)
Their current workflow is feature-dev-master branches BUT they have different environments.
So they use dev branch to publish to dev env and after dev testing - to QA for proper QA testing.
After it's done - branch goes into master, This master branch is published into UAT environment and after confirmation from client - master branch goes into Prod env as well.
https://ibb.co/1nMR50w
problem:
Now the problem here is everything that is in master should be marked as "ready for production" which means every client should check his story and give his approval.
And now we are not in development phase but rather in support phase, which means no planned releases, mainly small changes and bugs.
So my team is facing the current issue - we have couple of features/bugs implemented and ready to be delivered after UAT testing. Suddenly another client came with some critical data issue that we need to fix. We fix it but we can not push it into prod as there are 2 changes that are waiting for UAT approval.
​
Quick solution here would be cherry pick. But it's quite typical scenario so we should cherry pick every time. Moreover as this critical fix was tested on UAT we can not guarantee (like 99.99% but not 100%) that the same correct behaviour remains after we push it into production without other 2 features. Ideally we kind of need to test it again, which doesn't make a lot of sense.
​
I came up with the new flow. Which works better in terms that we will have the branch with only those changes that will go to the production. But it doesn't mitigate this cherry pick issue completely and I'm not sure if there anything else we can improve.
https://ibb.co/Lddpq4m
https://ibb.co/8D3Dv2s
https://redd.it/ypkzmz
@r_devops
Hi Guys
​
usual approach:
We usually use default or slightly modified git branch strategy with feature-dev-master branches
we create features from dev and put it into dev. After some time Code freeze is declared, dev is "locked", tested by QA and then pushed into master. Master is considered prod-ready and packages built from it are shared with clients.
​
current project approach:
On another project that I joined, my client provides a website to his own clients. Clients upload data that is transformed and prepared to be consumed as files and reports. Their logic is mainly separated but there are some common parts. So some parts may intervene with each other(!)
Their current workflow is feature-dev-master branches BUT they have different environments.
So they use dev branch to publish to dev env and after dev testing - to QA for proper QA testing.
After it's done - branch goes into master, This master branch is published into UAT environment and after confirmation from client - master branch goes into Prod env as well.
https://ibb.co/1nMR50w
problem:
Now the problem here is everything that is in master should be marked as "ready for production" which means every client should check his story and give his approval.
And now we are not in development phase but rather in support phase, which means no planned releases, mainly small changes and bugs.
So my team is facing the current issue - we have couple of features/bugs implemented and ready to be delivered after UAT testing. Suddenly another client came with some critical data issue that we need to fix. We fix it but we can not push it into prod as there are 2 changes that are waiting for UAT approval.
​
Quick solution here would be cherry pick. But it's quite typical scenario so we should cherry pick every time. Moreover as this critical fix was tested on UAT we can not guarantee (like 99.99% but not 100%) that the same correct behaviour remains after we push it into production without other 2 features. Ideally we kind of need to test it again, which doesn't make a lot of sense.
​
I came up with the new flow. Which works better in terms that we will have the branch with only those changes that will go to the production. But it doesn't mitigate this cherry pick issue completely and I'm not sure if there anything else we can improve.
https://ibb.co/Lddpq4m
https://ibb.co/8D3Dv2s
https://redd.it/ypkzmz
@r_devops
ImgBB
Screenshot-2022-11-08-at-15-35-44
Image Screenshot-2022-11-08-at-15-35-44 hosted in ImgBB
👍1
Automation API-like feature for Terraform CDK?
Is there a way to embed Terraform CDK code in a clean way like we can do with Pulumi's Automation API?
https://redd.it/ypnu10
@r_devops
Is there a way to embed Terraform CDK code in a clean way like we can do with Pulumi's Automation API?
https://redd.it/ypnu10
@r_devops
reddit
Automation API-like feature for Terraform CDK?
Is there a way to embed Terraform CDK code in a clean way like we can do with Pulumi's Automation API?