TL;DR Separating environments (dev, stg, prd) per cloud-provider account (AWS, Azure, GCP, etc.) is preferred for having the minimum to zero downtime when deploying to production. This includes separating dev from stg, even though it’s common to manage dev and stg in the same account.
The cover image implies that a Single account will take place in the comparison, I’m leaving it out of the scope since there are too many reasons to avoid such a thing. The image also implies that it’s terrifying.
In this blog-post, I’ll share the difficulties that I faced during deployments to production when using two accounts
dev+stg and prd. These difficulties could’ve been avoided, by separating all environments to different accounts
dev, stg, and prd.
Without further ado,
dev+stg and prd vs.
dev, stg, and prd
Let’s assume each environment includes a VPC and a single Virtual Machine (EC2 instance). The Virtual Machine is deployed in a public subnet and has a public static IP (EIP - Elastic IP).
The default quota of EIPs is 5 per region. Our application grew larger, and suddenly we hit the limit of maximum EIPs. In
dev+stg account, we go to the Requesting a quota increase page and raise the limit from 5 to 20 (“just in case”). It can take minutes to hours for the quota to update; usually, it’s a matter of minutes (do you feel lucky?)
It’s time to deploy from
stg, and since we know what we’re doing, we use Infrastructure as Code (IaC) to manage our environment’s resources, specifically, we use Terraform.
We run a CI/CD pipeline and plan from
stg; all goes well, so we apply the changes in
stg and … It worked as expected. Now it’s time to plan from
prd and if the plan is ok then applying the changes to
prd is totally safe, right? Wrong. Terraform does not take into account services quotas, so even if the plan has passed, applying the changes in
prd will result in a failed deployment to your
prd environment. It’s because “we” forgot to raise the service quota in
prd, so now we have to log in to prd-account and ask for a quota increase, pray that it won’t take long, and re-deploy to
Rewind, the application grew larger, and we hit the limit of EIPs. In the
dev account, we request a quota increase, and so on. Now it’s time to plan from
stg; it goes well, so we apply the changes in
stg, and boom 💥, we get an error. Luckily, it happened in
stg and not in
prd, so now we know that we must request a quota increase before deploying to
prd. We even got proof of that in the logs of the CI/CD service.
Let’s talk about Elastic Container Registry (ECR); usually, I store all of the Docker images in the same account, and allow access from all other accounts. Some companies prefer to store the images in dev-account (or dev+stg-account), some in prd-account; either way, it requires sharing the images between accounts.
The images are differentiated by tags, like
app:184.108.40.2063-prd. You can see here that I omitted
prd images are deployed to
stg before they are deployed to
prd; we don’t build something new for
prd. This is not mandatory, and some applications might require a specific build for
prd, though it’s not recommended.
dev+stg is great because there’s no overhead of dealing with repository policies since they’re in the same account, so just set it once for both of them. Deploying containers to
stg is working as expected with no surprises. Production time! You guessed it; it fails because the
prd account cannot access the images unless you add a repository policy that allows it.
I’m proud of you. You guessed it right again; the same issue will happen when we deploy to
stg, forcing us to add the relevant repository policy to allow access from
stg and from
Forgetting to request a quota increase in
prd is something that can happen to anyone, so why take the chance?
“I am a human being. I was designed to make mistakes. The DevOps team’s job is to design a system that mitigates the risk from those mistakes as much as possible (being honest here, no system is idiot-proof).” (by me)
stg account that is separated from
dev, provides the optimal conditions for avoiding unforeseeable exceptions when deploying to
prd, which is exactly what we aim for - no surprises in production.
It’s important to take into account the costs of separating environments into different accounts. For example, if you’re using AWS Web Application Firewall (WAF), it’s possible to attach the firewall to resources of both
stg into different accounts means you’ll need to create the WAF resource in both accounts, hence pay for it “twice”.
It all comes to the question of what will cost more, potentially paying more for resources that could have been shared between
stg, or having unpredictable deployments to
prd that might result in unwanted downtime. Once you answer this question, you’ll know if you’re willing to separate
If you got this far, it means you’re really into it, so if you want to take it a step further, it’s best to create another account for “operations” (Ops); some might call it “management” (Mgmt).
Some companies use their dev-account as the ops-account. It’s not that bad, though it’s a bit missing the point of having services that impact all environments and accounts, in the dev-account.
Here are the main reasons for having a separate ops-account:
And if you got this far, then I guess you learned something new today :) The examples provided in this blog-post are merely the tip of the iceberg. There are many more cases where separating environments into different accounts reduces the number of white-hairs you’re growing each day; I’m sure it was scientifically proved somewhere.