Troubleshooting the right way

7 mins
Published on 25 January 2021

In this blog-post, I will share a methodology for troubleshooting technical challenges.

As a DevOps engineer, I face technical challenges daily. In my early days, I rushed into finding solutions because “everything is time-critical”. Rushing into solving a challenge without a plan slowed down my work and raised my frustration when facing new challenges.

Change your mindset; it’s a challenge

See how I wrote down “facing new challenges”? This fine-tuning from “issue” to “challenge” makes a big difference.

Associations when someone raises an issue

  • Why do I need to fix this issue
  • I wish this issue wasn’t time-critical for the Developers’ team
  • I hate issues

Associations when facing a new challenge

  • This challenge tickles my brain; I’m glad I was assigned to do it
  • The more time-critical the task is, the greater the challenge, and I love being “the one who solves things quickly”
  • I love challenges

I know, I know, there are some boring and annoying issues (challenges) out there, but still, if you tag them as challenges, puzzles, or mysteries, then solving them can turn into a great journey of learning new things.

challenge-accepted

Cheatsheet

Below you’ll find the steps that I go through when facing a new challenge. Some of them are just thoughts, and some of them involve writing or drawing something. Once you do it on a challenge-ly-basis, you’ll find yourself doing most of these steps in your head.

  1. Define the challenge in simple words
    • Example: Disallow outbound connection from my server to a public remote storage service on the internet
  2. How will you test that it works - write it down and don’t adjust it to your solution. Defining, beforehand, the way you’ll test the solution and the expected result will mitigate the risk of being biased towards a specific solution
    • Example: I should stop seeing new data in the remote storage service
  3. Map the components - The least you can do is write (type) down the components that should be analyzed. A better thing to do is sketch the components and the flow of the process that needs to be analyzed. The best thing to do is draw a diagram in draw.io, but keep in mind that it’s time-consuming and only necessary for very complicated challenges. Here’s an example of writing down a list of components for disallowing internet access from a cloud-hosted server
  4. Subnet’s Routes Table
  5. Subnet’s Network Access Lists
  6. Cloud-provider firewall
  7. Server’s firewall
  8. Application - depends on the app
  9. I think that five is enough for now
  10. Prioritize the solutions from best to worst according to
    1. Reliability - A permanent solution needs to be reliable (stable) and requires a good design. Attempting to apply an ad-hoc solution, “just to make it work” will make it harder to troubleshoot for your colleagues or future you
    2. Time to first results - If it fails quickly, it’s easier to move on to the next solution. Don’t start with the “longest” solution; you’ll be exhausted once you get to the other ones on the list
    3. Effort to explain the solution - If you aim for a permanent solution, then your colleagues, or future you, should also understand the logic behind it. If it takes 5 hours to explain the solution, then it should be prioritized very low or probably dropped
  11. Iterate over the solutions - Remember, it’s best to start with the ones that will provide results quickly. When hitting a solution that works, write down documentation to reproduce the steps you went through, which leads me to the next step
  12. Documentation - This is mostly done for complex challenges. For easy to moderate challenges, it’s adequate to add comments in the ticketing system that is in use (JIRA, Trello, etc.)

That’s it; the next part demonstrates how to apply this method in a real-life challenge of a DevOps engineer. It’s quite technical, so keep on reading if you’re here for the goodies!

Real-life technical challenge

One of my customers is using Prometheus to monitor their application, which runs in a Kubernetes cluster, K3S in my case. The whole cluster runs on a single AWS EC2 instance with Ubuntu 18.04 operating system; this might sound weird to run a whole Kubernetes cluster on a single machine, but I can’t get into more details, bare with me.

The customer requires data retention for Prometheus’s metrics, so we send the data (metrics) to a public remote storage service, NewRelic in my case, by setting the remote_write key in Prometheus’s configuration file (prometheus.yml).

Initially, I thought that seeing Done Replaying WAL in Prometheus’s logs is enough to assume that the remote_write event was successful. Surprisingly, I was wrong.

The challenge: Disallow outbound connection from Prometheus to NewRelic, to make it possible to investigate Prometheus’s logs and understand which errors (if any) are raised when there’s no internet connection upon a remote_write event.

troubleshooting-the-right-way.drawio-diagram

How will I test it?: Checking NewRelic’s dashboards to see if new data (metrics) is received from Prometheus (it shouldn’t receive)

This challenge does not require a “permanent solution”, but more of finding the easiest way to prevent private resources to access public services without affecting my colleagues.

NOTE: All the tests were performed in the “development” environment, where network changes might affect other developers.

Analyzing

Mapping the components by writing a list of the components, including a short description of the offered solution, and prioritizing the solutions. The part where I mention the difficulty (“simple”, “overkill”) is done in my head, and I don’t really write it down.

  1. Cloud-provider firewall (AWS Security Group) - Remove the Allow outbound to 0.0.0.0/0. (Difficulty: Simple via AWS Console)
  2. Server’s (EC2 instance) firewall (ufw) - Add a deny all outbound to 0.0.0.0/0 rule to EC2 instance firewall. A quick Google search got me to this Stackoverflow answer, which provided a fairly easy solution (Difficulty: Okayish by SSH to the EC2 instance)
  3. Application - Add a Kubernetes Network Policy that blocks access to the 0.0.0.0/0. Requires writing down a yaml file and dealing with the Kubernetes ecosystem (Difficulty: Overkill, do it when all else fails)
  4. Subnet’s Network Access List (NACLs) - Add the rule outbound deny 0.0.0.0/0 to the subnet’s NACL. Dropped this one because it might affect other resources in the same subnet, which in turn can disturb other developers’ work (Difficulty: Simple via AWS console)
  5. Subnet’s Routes Table - Remove the route to 0.0.0.0/0 from the routing table. Simple via AWS console. Dropped for the same reason, I dropped NACLs (Difficulty: Simple via AWS console)

NOTE: All of these experiments could’ve been avoided if I had a deeper understanding of the term stateful. Here’s what I learned from AWS Docs about Connection tracking - A change to an inbound/outbound rule of a Security Group that initially allows a connection will not break existing connections.

I narrowed it down to two solutions - Security Group Rule and Server’s firewall (ufw). Both solutions meet the previously mentioned parameters

  1. Time to first results
  2. Reliability
  3. Effort to explain

Iterating over the solutions

Reminding you that the expected output is: stop seeing new data in the remote storage service (NewRelic).

Security Group Rule

It will be easier to describe the steps in bullets.

  1. AWS Console > EC2 > Security Groups > Edit Security Group
  2. Outbound rules > Remove 0.0.0.0/0
  3. NewRelic > Check for new data > No good, data is still coming
  4. Added back the rule to allow outbound to 0.0.0.0/0

I was shocked that this solution didn’t work, I was really counting on it, but no worries, I got another solution up in my sleeve.

Server’s firewall (ufw)

  1. SSH to EC2 instance
  2. Execute sudo ufw default deny outgoing
  3. NewRelic > Check for new data > No good, data is still coming

Am I missing something? I used NetCat (nc) to check if the EC2 instance has access to NewRelic’s endpoint, and it doesn’t, so what’s going on? Here’s how I checked -

# -v = verbose
# -w = timeout after 3 seconds
# 443 = check this port, in our case it's HTTPS
$ nc -v -w 3 metric-api.newrelic.com 443
nc: connect to metric-api.newrelic.com port 443 (tcp) timed out: Operation now in progress
# This is good. It means that all access to 0.0.0.0/0 is blocked

# Example for a successful response
# Keep in mind that we DON'T want it to succeed
$ nc -v -w 3 metric-api.newrelic.com 443
Connection to metric-api.newrelic.com 443 port [tcp/https] succeeded

Is it me? Or Newrelic?

To make sure that it’s a client-side issue (mine) and not a server-side issue (NewRelic’s), I turned off the EC2 instance and checked if new data is still coming, and guess what, NewRelic stopped receiving new data (metrics).

By now, I’m positively sure that this can be solved on my end, and I need to investigate where’s this “leak of internet access”.

NOTE: Since I blocked all access to my EC2 instance with ufw, I couldn’t SSH to it anymore, and I had to create a new one. No biggy, we got an automated process that creates an EC2 instance and deploys K3S (Kubernetes cluster).

The Epiphany

Mentioning the obvious first - turning down the EC2 instance proved that Prometheus had access to NewRelic, even though a new Security Group rule was changed.

I decided to do a softer test since shutting an EC2 instance can’t really lead to anything. So I scaled down Prometheus to 0, changed the Security Group rule, and then scaled up Prometheus back to 1. Here’s how I did it -

$ kubectl scale --replicas=0 deployment/prometheus
deployment.apps/prometheus scaled

# Modifying the Security Group Rule
# 1. AWS Console > EC2 > Security Groups > Edit Security Group
# 2. Outbound rules > Remove `0.0.0.0/0`

$ kubectl scale --replicas=1 deployment/prometheus
deployment.apps/prometheus scaled

# # 3. NewRelic > Check for new data > Tada! No new data!

It worked! Stopping Prometheus broke the active connection that Prometheus previously had with NewRelic. I wasn’t aware that remote_write keeps an active connection; I was sure it just sends the data and closes the connection. Apparently, it is documented in the official changelog of Prometheus - 1.8.0 / 2017-10-06 - ”..Remote storage connections use HTTP keep-alive..

Documentation: I think that this blog-post is enough 😉

Finally, I can check the logs of Prometheus and see which errors are raised when the outbound connection is not allowed.

References

Final words

The effort of applying a solution is negligible compared to the effort of finding the root-cause, this is why getting results quickly improves the ability to understand what’s going on under the hood.

The first that I do when I learn something new, is to search for examples and demonstrations because they ease the process of absorbing new information. Getting results quickly from a “failing” solution is similar to an example - an output that was generated by a known sequence of steps.

“Facing challenges is an adventure, so enjoy the ride and make sure you take notes; rock on!” 🤘 (by me)

Related Posts