Reddit DevOps

What can we do better?

I work at a small startup and we offer a system on the web. We currently have 500 subscribers.

Our most pressing issue is that we don't always deliver updates with actual quality and end up hurting our clients in the process.

A few of our latest issues have been:

A guy dropped an index on our database and thought his query to create another index to replace it had run successfully, but it hadn't. Our database was overloaded for about an hour, until he realized his mistake.

A developer was using an ORM to generate queries but one of the generated queries ended up using the wrong date field, which had no index. Many clients reported the system was slow as a result of that.

A front-end developer fixed a bug but ended up bringing back another bug, which had already been fixed before.

I updated our Redis cluster (which changed its hostname) and forgot that there was a pretty important Lambda function which used that hostname. Only found out a week later.

Our main system is in Java, with a mix of Spring and Struts. It's also pretty monolithic. We're currently in the process of migrating most of our SSR pages to Angular and we're also making our back-end available as a public API through AWS API Gateway.

All our developers get full dumps of our production database whenever they need it. The downside of that (security aside) is that they have to wait for hours for their local database to be ready. The back-end developers can also connect directly to our production database (running on AWS RDS) when they need to debug.

Our back-end has a few tests, but they're all very basic and they were only introduced because someone said "hey, we need tests". No new tests have been added since September, even though we've made a lot of updates since then.

Our front-end has zero tests.

We create a different Git branch for every new issue. After the developer finishes their work, they send it to the staging girl to test it. The staging girl manually checks if everything is working as intended. A big issue with that is that she cannot catch any bugs that demand a lot of traffic to reproduce, which end up being the most serious bugs, since they affect everyone.

After staging is done, the developer opens a pull request. Another guy reviews the PR and approves it. The code will, then, go through a pipeline which builds it using Maven and uploads it to Elastic Beanstalk. When traffic isn't too high, we start the updates manually, selecting the version we want and rolling back if there was any issue.

The infrastructure for our main system was created manually, through the AWS console. I've been using IaC (AWS CDK) for new micro-services and when I need to move a service to a different kind of infrastructure. There is no pipeline for infrastructure; updates are performed manually.

Whenever there's a performance/stability issue, I use CloudWatch metrics and logs, as well as VisualVM, to diagnose it. One problem we have is that we don't have a history of our JVM metrics. If we don't happen to be at the office at the time of the issue, we have no way of telling what went wrong until the problem resurfaces.

https://redd.it/10inb1i
@r_devops

What can we do better?

I work at a small startup and we offer a system on the web. We currently have 500 subscribers. Our most pressing issue is that we don't always...

7 views16:52