In this blog-post, I will share a methodology for troubleshooting technical challenges.
As a DevOps engineer, I face technical challenges daily. In my early days, I rushed into finding solutions because “everything is time-critical”. Rushing into solving a challenge without a plan slowed down my work and raised my frustration when facing new challenges.
See how I wrote down “facing new challenges”? This fine-tuning from “issue” to “challenge” makes a big difference.
Associations when someone raises an issue
Associations when facing a new challenge
I know, I know, there are some boring and annoying issues (challenges) out there, but still, if you tag them as challenges, puzzles, or mysteries, then solving them can turn into a great journey of learning new things.
Below you’ll find the steps that I go through when facing a new challenge. Some of them are just thoughts, and some of them involve writing or drawing something. Once you do it on a challenge-ly-basis, you’ll find yourself doing most of these steps in your head.
That’s it; the next part demonstrates how to apply this method in a real-life challenge of a DevOps engineer. It’s quite technical, so keep on reading if you’re here for the goodies!
One of my customers is using Prometheus to monitor their application, which runs in a Kubernetes cluster, K3S in my case. The whole cluster runs on a single AWS EC2 instance with Ubuntu 18.04 operating system; this might sound weird to run a whole Kubernetes cluster on a single machine, but I can’t get into more details, bare with me.
The customer requires data retention for Prometheus’s metrics, so we send the data (metrics) to a public remote storage service, NewRelic in my case, by setting the remote_write key in Prometheus’s configuration file (prometheus.yml).
Initially, I thought that seeing Done Replaying WAL
in Prometheus’s logs is enough to assume that the remote_write event was successful. Surprisingly, I was wrong.
The challenge: Disallow outbound connection from Prometheus to NewRelic, to make it possible to investigate Prometheus’s logs and understand which errors (if any) are raised when there’s no internet connection upon a remote_write event.
How will I test it?: Checking NewRelic’s dashboards to see if new data (metrics) is received from Prometheus (it shouldn’t receive)
This challenge does not require a “permanent solution”, but more of finding the easiest way to prevent private resources to access public services without affecting my colleagues.
NOTE: All the tests were performed in the “development” environment, where network changes might affect other developers.
Mapping the components by writing a list of the components, including a short description of the offered solution, and prioritizing the solutions. The part where I mention the difficulty (“simple”, “overkill”) is done in my head, and I don’t really write it down.
Allow outbound to 0.0.0.0/0
. (Difficulty: Simple via AWS Console)deny all outbound to 0.0.0.0/0
rule to EC2 instance firewall. A quick Google search got me to this Stackoverflow answer, which provided a fairly easy solution (Difficulty: Okayish by SSH to the EC2 instance)blocks access to the 0.0.0.0/0
. Requires writing down a yaml file and dealing with the Kubernetes ecosystem (Difficulty: Overkill, do it when all else fails)outbound deny 0.0.0.0/0
to the subnet’s NACL. Dropped this one because it might affect other resources in the same subnet, which in turn can disturb other developers’ work (Difficulty: Simple via AWS console)0.0.0.0/0
from the routing table. Simple via AWS console. Dropped for the same reason, I dropped NACLs (Difficulty: Simple via AWS console)NOTE: All of these experiments could’ve been avoided if I had a deeper understanding of the term stateful. Here’s what I learned from AWS Docs about Connection tracking - A change to an inbound/outbound rule of a Security Group that initially allows a connection will not break existing connections.
I narrowed it down to two solutions - Security Group Rule and Server’s firewall (ufw). Both solutions meet the previously mentioned parameters
Reminding you that the expected output is: stop seeing new data in the remote storage service (NewRelic).
It will be easier to describe the steps in bullets.
0.0.0.0/0
0.0.0.0/0
I was shocked that this solution didn’t work, I was really counting on it, but no worries, I got another solution up in my sleeve.
sudo ufw default deny outgoing
Am I missing something? I used NetCat (nc) to check if the EC2 instance has access to NewRelic’s endpoint, and it doesn’t, so what’s going on? Here’s how I checked -
# -v = verbose
# -w = timeout after 3 seconds
# 443 = check this port, in our case it's HTTPS
$ nc -v -w 3 metric-api.newrelic.com 443
nc: connect to metric-api.newrelic.com port 443 (tcp) timed out: Operation now in progress
# This is good. It means that all access to 0.0.0.0/0 is blocked
# Example for a successful response
# Keep in mind that we DON'T want it to succeed
$ nc -v -w 3 metric-api.newrelic.com 443
Connection to metric-api.newrelic.com 443 port [tcp/https] succeeded
To make sure that it’s a client-side issue (mine) and not a server-side issue (NewRelic’s), I turned off the EC2 instance and checked if new data is still coming, and guess what, NewRelic stopped receiving new data (metrics).
By now, I’m positively sure that this can be solved on my end, and I need to investigate where’s this “leak of internet access”.
NOTE: Since I blocked all access to my EC2 instance with ufw, I couldn’t SSH to it anymore, and I had to create a new one. No biggy, we got an automated process that creates an EC2 instance and deploys K3S (Kubernetes cluster).
Mentioning the obvious first - turning down the EC2 instance proved that Prometheus had access to NewRelic, even though a new Security Group rule was changed.
I decided to do a softer test since shutting an EC2 instance can’t really lead to anything. So I scaled down Prometheus to 0, changed the Security Group rule, and then scaled up Prometheus back to 1. Here’s how I did it -
$ kubectl scale --replicas=0 deployment/prometheus
deployment.apps/prometheus scaled
# Modifying the Security Group Rule
# 1. AWS Console > EC2 > Security Groups > Edit Security Group
# 2. Outbound rules > Remove `0.0.0.0/0`
$ kubectl scale --replicas=1 deployment/prometheus
deployment.apps/prometheus scaled
# # 3. NewRelic > Check for new data > Tada! No new data!
It worked! Stopping Prometheus broke the active connection that Prometheus previously had with NewRelic. I wasn’t aware that remote_write keeps an active connection; I was sure it just sends the data and closes the connection. Apparently, it is documented in the official changelog of Prometheus - 1.8.0 / 2017-10-06 - ”..Remote storage connections use HTTP keep-alive..
Documentation: I think that this blog-post is enough 😉
Finally, I can check the logs of Prometheus and see which errors are raised when the outbound connection is not allowed.
The effort of applying a solution is negligible compared to the effort of finding the root-cause, this is why getting results quickly improves the ability to understand what’s going on under the hood.
The first that I do when I learn something new, is to search for examples and demonstrations because they ease the process of absorbing new information. Getting results quickly from a “failing” solution is similar to an example - an output that was generated by a known sequence of steps.
“Facing challenges is an adventure, so enjoy the ride and make sure you take notes; rock on!” 🤘 (by me)