Reddit DevOps

I want to scale up/down as fast as AWS Lambda but also be able to allocate the vCPU (minimum 8) per task, any advice?

# Goal

So I'm looking for an AWS product or combination of products to accomplish the following:

* Ability to do CPU heavy (non-parallel) calculations (min 8 vCPU's per node)
* Minimum timeout limit of 1800 seconds (AWS Lambda has a limit of 900 seconds)
* Ability to automatically scale up and down to 0/1
* Ability to scale up really fast (<30 seconds)
* Event-driven execution model (one task per node, node gets destroyed after the job is finished)
* Assign the desired vCPU's per "type" (3 predefined types) event/task/job

I'll try to be as detailed as possible in my description of what the job is, the two setups I've tried and what I don't like about them.

\---

## Definitions

* ROE: Route Optimisation Engine
* VRP: Vehicle Routing ProblemA single JSON containing all the information regarding the stops, vehicles, time windows, start/end addresses) which needs to be optimised. VRP's can be classified into three different complexities \[easy, medium, hard\] based on:
* number of stops
* restrictions per stop (time window, capacity)
* number of vehicles
* restrictions per vehicle (time window, maximum range, capacity, breaks)
* Easy classified VRP's will need less CPU resources and can be solved quicker than for example high classified VRP's.
* Solution: The best solution for the VRP (the most efficient routes for the VRP)

## Setup 1 - AWS Lambda

## Diagram

[https://preview.redd.it/iqabr1pt2tu41.png?width=743&format=png&auto=webp&s=fe05f8015a8d72ef71c016c9aa200023a390cf6c](https://preview.redd.it/iqabr1pt2tu41.png?width=743&format=png&auto=webp&s=fe05f8015a8d72ef71c016c9aa200023a390cf6c)

## AWS SQS

Standard type AWS SQS queue for optimisation request messages. Every message that enters triggers the ROE (AWS Lambda).

## AWS Lambda

The Lambda (3008 MB memory) is triggered by the AWS SQS queue for optimisation request messages and processes alles messages when they are added to the queue. The maximum timeout limit of all AWS Lambda's is 15 minutes.

## Problems

* AWS Lambda does not allow us to select more CPU resources and therefore optimisations of medium complexity VRP's take a long time.
* AWS Lambda maximum timeout settings of 15 minutes makes the usage of a Lambda for doing optimising high complexity VRP impossible

## Setup 2 - AWS Batch

## Diagram

[https://preview.redd.it/twkxguw03tu41.png?width=808&format=png&auto=webp&s=34f951730078dc9fc25c345a3a5584ce8980992d](https://preview.redd.it/twkxguw03tu41.png?width=808&format=png&auto=webp&s=34f951730078dc9fc25c345a3a5584ce8980992d)

## Steps

* WA sends optimisation request to API
* API creates a VRP and stores the VRP in S3
* API evaluates the complexity of the VRP (low, medium or high)
* API creates an AWS Batch Job (with the parameter problemId=123456789)
* API allocates the correct number of vCPU's and Memory (predefined) to that job based on the VRP complexity
* API adds the AWS Batch Job to the correct AWS Batch queue (the queue is defined by the VRP complexity)
* API returns a 200 - OK response to the WA
* AWS Batch Job is taken from the AWS Batch Queue for execution
* AWS Batch Job fetches the VRP from S3 based on the problemId it received during step 4
* AWS Batch Job solves the VRP
* AWS Batch Job stores the solution in S3
* WA requests the solution from the API
* API gets the solution from S3 and returns it to WA

## Problems

* Scaling up takes unacceptably long
* Going from 0 to 1 node takes too long (300+ seconds)
* Going from 1 to X nodes takes too long (60/300+ seconds)

https://redd.it/g7gs31
@r_devops

4 views09:28