TL;DR Separating environments (dev, stg, prd) per cloud-provider account (AWS, Azure, GCP, etc.) is preferred for having the minimum to zero downtime when deploying to production. This includes separating dev from stg, even though it’s common to manage dev and stg in the same account.
The cover image implies that a Single account will take place in the comparison, I’m leaving it out of the scope since there are too many reasons to avoid such a thing. The image also implies that it’s terrifying.
In this blog-post, I’ll share the difficulties that I faced during deployments to production when using two accounts dev+stg and prd
. These difficulties could’ve been avoided, by separating all environments to different accounts dev, stg, and prd
.
Without further ado, dev+stg and prd
vs. dev, stg, and prd
Let’s assume each environment includes a VPC and a single Virtual Machine (EC2 instance). The Virtual Machine is deployed in a public subnet and has a public static IP (EIP - Elastic IP).
The default quota of EIPs is 5 per region. Our application grew larger, and suddenly we hit the limit of maximum EIPs. In dev+stg
account, we go to the Requesting a quota increase page and raise the limit from 5 to 20 (“just in case”). It can take minutes to hours for the quota to update; usually, it’s a matter of minutes (do you feel lucky?)
It’s time to deploy from dev
to stg
, and since we know what we’re doing, we use Infrastructure as Code (IaC) to manage our environment’s resources, specifically, we use Terraform.
We run a CI/CD pipeline and plan from dev
to stg
; all goes well, so we apply the changes in stg
and … It worked as expected. Now it’s time to plan from stg
to prd
and if the plan is ok then applying the changes to prd
is totally safe, right? Wrong. Terraform does not take into account services quotas, so even if the plan has passed, applying the changes in prd
will result in a failed deployment to your prd
environment. It’s because “we” forgot to raise the service quota in prd
, so now we have to log in to prd-account and ask for a quota increase, pray that it won’t take long, and re-deploy to prd
.
Rewind, the application grew larger, and we hit the limit of EIPs. In the dev
account, we request a quota increase, and so on. Now it’s time to plan from dev
to stg
; it goes well, so we apply the changes in stg
, and boom 💥, we get an error. Luckily, it happened in stg
and not in prd
, so now we know that we must request a quota increase before deploying to prd
. We even got proof of that in the logs of the CI/CD service.
Let’s talk about Elastic Container Registry (ECR); usually, I store all of the Docker images in the same account, and allow access from all other accounts. Some companies prefer to store the images in dev-account (or dev+stg-account), some in prd-account; either way, it requires sharing the images between accounts.
The images are differentiated by tags, like app:1.0.1.2383-dev
and app:1.0.1.2383-prd
. You can see here that I omitted stg
because prd
images are deployed to stg
before they are deployed to prd
; we don’t build something new for prd
. This is not mandatory, and some applications might require a specific build for prd
, though it’s not recommended.
Working on dev+stg
is great because there’s no overhead of dealing with repository policies since they’re in the same account, so just set it once for both of them. Deploying containers to stg
is working as expected with no surprises. Production time! You guessed it; it fails because the prd
account cannot access the images unless you add a repository policy that allows it.
I’m proud of you. You guessed it right again; the same issue will happen when we deploy to stg
, forcing us to add the relevant repository policy to allow access from stg
and from prd
.
Forgetting to request a quota increase in prd
is something that can happen to anyone, so why take the chance?
“I am a human being. I was designed to make mistakes. The DevOps team’s job is to design a system that mitigates the risk from those mistakes as much as possible (being honest here, no system is idiot-proof).” (by me)
Deploying to stg
account that is separated from dev
, provides the optimal conditions for avoiding unforeseeable exceptions when deploying to prd
, which is exactly what we aim for - no surprises in production.
It’s important to take into account the costs of separating environments into different accounts. For example, if you’re using AWS Web Application Firewall (WAF), it’s possible to attach the firewall to resources of both dev
and stg
. Separating dev
and stg
into different accounts means you’ll need to create the WAF resource in both accounts, hence pay for it “twice”.
It all comes to the question of what will cost more, potentially paying more for resources that could have been shared between dev
and stg
, or having unpredictable deployments to prd
that might result in unwanted downtime. Once you answer this question, you’ll know if you’re willing to separate stg
from dev
.
If you got this far, it means you’re really into it, so if you want to take it a step further, it’s best to create another account for “operations” (Ops); some might call it “management” (Mgmt).
Some companies use their dev-account as the ops-account. It’s not that bad, though it’s a bit missing the point of having services that impact all environments and accounts, in the dev-account.
Here are the main reasons for having a separate ops-account:
And if you got this far, then I guess you learned something new today :) The examples provided in this blog-post are merely the tip of the iceberg. There are many more cases where separating environments into different accounts reduces the number of white-hairs you’re growing each day; I’m sure it was scientifically proved somewhere.