The importance of separated environments

TL;DR Separating environments (dev, stg, prd) per cloud-provider account (AWS, Azure, GCP, etc.) is preferred for having the minimum to zero downtime when deploying to production. This includes separating dev from stg, even though it's common to manage dev and stg in the same account.

Prologue

The cover image implies that a Single account will take place in the comparison, I'm leaving it out of the scope since there are too many reasons to avoid such a thing. The image also implies that it's terrifying.

In this blog-post, I'll share the difficulties that I faced during deployments to production when using two accounts dev+stg and prd. These difficulties could've been avoided, by separating all environments to different accounts dev, stg, and prd.

Assumptions

  • For the sake of simplicity, from now on, I'll refer to the cloud-provider as AWS, though it is relevant to other cloud-providers.
  • All environments are deployed in the same region since we don't want to find out that some cloud service is available in Ireland but not in London.

Without further ado, dev+stg and prd vs. dev, stg, and prd

1 Service quotas surprises

Let's assume each environment includes a VPC and a single Virtual Machine (EC2 instance). The Virtual Machine is deployed in a public subnet and has a public static IP (EIP - Elastic IP).

The default quota of EIPs is 5 per region. Our application grew larger, and suddenly we hit the limit of maximum EIPs. In dev+stg account, we go to the Requesting a quota increase page and raise the limit from 5 to 20 ("just in case"). It can take minutes to hours for the quota to update; usually, it's a matter of minutes (do you feel lucky?)

It's time to deploy from dev to stg, and since we know what we're doing, we use Infrastructure as Code (IaC) to manage our environment's resources, specifically, we use Terraform.

dev+stg (same account)

We run a CI/CD pipeline and plan from dev to stg; all goes well, so we apply the changes in stg and ... It worked as expected.
Now it's time to plan from stg to prd and if the plan is ok then applying the changes to prd is totally safe, right? Wrong. Terraform does not take into account services quotas, so even if the plan has passed, applying the changes in prd will result in a failed deployment to your prd environment. It's because "we" forgot to raise the service quota in prd, so now we have to log in to prd-account and ask for a quota increase, pray that it won't take long, and re-deploy to prd.

dev and stg (separate accounts)

Rewind, the application grew larger, and we hit the limit of EIPs. In the dev account, we request a quota increase, and so on. Now it's time to plan from dev to stg; it goes well, so we apply the changes in stg, and boom 💥, we get an error. Luckily, it happened in stg and not in prd, so now we know that we must request a quota increase before deploying to prd. We even got proof of that in the logs of the CI/CD service.

2 Repository policies joy oh joy

Let's talk about Elastic Container Registry (ECR); usually, I store all of the Docker images in the same account, and allow access from all other accounts. Some companies prefer to store the images in dev-account (or dev+stg-account), some in prd-account; either way, it requires sharing the images between accounts.

The images are differentiated by tags, like app:1.0.1.2383-dev and app:1.0.1.2383-prd. You can see here that I omitted stg because prd images are deployed to stg before they are deployed to prd; we don't build something new for prd. This is not mandatory, and some applications might require a specific build for prd, though it's not recommended.

dev+stg (same account)

Working on dev+stg is great because there's no overhead of dealing with repository policies since they're in the same account, so just set it once for both of them. Deploying containers to stg is working as expected with no surprises. Production time! You guessed it; it fails because the prd account cannot access the images unless you add a repository policy that allows it.

dev and stg (separate accounts)

I'm proud of you. You guessed it right again; the same issue will happen when we deploy to stg, forcing us to add the relevant repository policy to allow access from stg and from prd.

We are human beings

Forgetting to request a quota increase in prd is something that can happen to anyone, so why take the chance?

"I am a human being. I was designed to make mistakes. The DevOps team's job is to design a system that mitigates the risk from those mistakes as much as possible (being honest here, no system is idiot-proof)." (by me)

Deploying to stg account that is separated from dev, provides the optimal conditions for avoiding unforeseeable exceptions when deploying to prd, which is exactly what we aim for - no surprises in production.

But the costs?!

It's important to take into account the costs of separating environments into different accounts. For example, if you're using AWS Web Application Firewall (WAF), it's possible to attach the firewall to resources of both dev and stg. Separating dev and stg into different accounts means you'll need to create the WAF resource in both accounts, hence pay for it "twice".

It all comes to the question of what will cost more, potentially paying more for resources that could have been shared between dev and stg, or having unpredictable deployments to prd that might result in unwanted downtime. Once you answer this question, you'll know if you're willing to separate stg from dev.

Bonus point - Ops account

If you got this far, it means you're really into it, so if you want to take it a step further, it's best to create another account for "operations" (Ops); some might call it "management" (Mgmt).

Some companies use their dev-account as the ops-account. It's not that bad, though it's a bit missing the point of having services that impact all environments and accounts, in the dev-account.

Here are the main reasons for having a separate ops-account:

  • DNS: It's easier to manage all DNS records in a single account. Doing it in dev-account is misleading since the DNS records are relevant to all accounts and environments, so using the Ops account makes more sense
  • CI/CD: Self-hosted CI/CD services ("runners") deploy to all environments, so it should be in a separate account
  • VPN: If you have a VPN connection from your on-premise environment to the cloud-provider resources, then the VPN connection's cloud resources should be done in a separate account.
  • "Top/Centralized Account": Some cloud-providers provide managed services for handling all accounts in a single place. By managing, I refer to billing, enforcing security policies across accounts, standardized naming convention across accounts, and more. Examples of these types of services: AWS Organizations, Azure DevOps, GCP Resource Manager.
    Using these services is useful for meeting the HIPAA/GDPR regulations, where "enforcing security across accounts" is vital.

References

Final Words

And if you got this far, then I guess you learned something new today 🙂
The examples provided in this blog-post are merely the tip of the iceberg. There are many more cases where separating environments into different accounts reduces the number of white-hairs you're growing each day; I'm sure it was scientifically proved somewhere.