As a longtime systems engineer, Blerim Sheqa knows all about using tools like Grafana to debug issues in infrastructures. Currently the CPO of Icinga, an open source monitoring software, he gave a talk at GrafanaCon LA about how not to fail at visualization.
There are common pitfalls that can lead you to spend hours “hunting ghosts,” said Sheqa. “Wrong data visualization leads us to misinterpretation. We think we found something in a graph that looks strange, but actually just the graph was strange.” Here are the things to look out for:
“I searched the web for the worst graph I could find,” Sheqa said. He came up with this one from a 2005 newspaper about a gun law in Florida.
The graph seems to show that the rate of murders by guns dropped after the law was passed. “But if you look closely, you can see on the left side that actually they just flipped the graph,” he pointed out. “It’s technically correct, but it doesn’t follow conventions.”
Lesson #1: “Following conventions is one important part when visualizing data, especially when using this data to debug any issue in your infrastructure.”
Another example is this load graph:
“We can see that the load drops and sometimes increases and looks normal at first glance,” Sheqa said. “What we cannot see here is that on the left side, the Y axis starts at about 60. So the load is actually pretty high, but because we didn’t set the correct minimum, we cannot see it at first glance. We believe that the graph shows us a pretty normal load on the server, where it actually is pretty high, depending on the hardware obviously.”
The graph should look like this:
Lesson #2: “Use proper labeling, setting minimum and maximum where it is necessary and setting the correct values and describing the data that we see.” You may know the data, but you have to make sure that your coworkers can understand what the graph shows, too.
This graph of memory usage on a server is stacked, which causes problems. You can’t see how much memory is free on the server.
“That is because the stacking like it is done here is completely useless,” said Sheqa. “You have to figure out what are the labels on the bottom, what is actually the free memory? Is it increasing or is it decreasing?”
The data is clear in this visualization:
“It’s still stacked, but I just flipped the values, and you can see at first glance that the memory is actually increasing,” he explained. “I put the total free memory to the background so you can still see it, but it’s not the first thing that comes into your eye.”
Lesson #3: Highlighting the important parts of a graph is pretty important for data visualization.
The following graph is about requests.
“It goes up and down, and it looks fancy and shiny, good on our dashboard. We can put it on a TV screen,” he said. “But actually we don’t know anything about the data that is shown here. Is it many requests? Are they going up or down? What is happening here?”
Here’s a better version of it:
Lesson #4: “Using the grid is the most important thing I figured out in the past,” said Sheqa. “It changes so much if you actually use the most common things that are out there, like grids and the Y axis, because now we can see how much the difference between the most bottom and the most up value actually is.”
The following CPU graph “tells us there is a peak somewhere,” said Sheqa, “and for comparing things, this is actually not the best thing that you can do because in this graph there are many CPUs merged into one graph and averaged. So just by looking at averages, we cannot tell any details about the behavior.”
“Split the graph like you should do, because you have more than one CPU,” he said. “In this case it’s four, and you can see that each chip CPU behaves completely differently.”
Lesson #5: “Grafana has this feature where we can just repeat panels or repeat entire rows. You should make use of that when you have something like CPUs or network statistics.”
Graphs should always be readable for everyone in your organization, not just the person who created it.
“You can add more features from Grafana like adding annotations, which add even more context to your graph so you can better understand what the graph is actually showing to you,” said Sheqa.
The following load graph has annotations about when monitoring sent an alert because of a failing service.
“You can not only see what is happening to the load, but you can also see, when did the monitoring alert us. At which point and at which time frame do I have to look at the graphs?” he said.
On the other hand, there is such a thing as too many annotations and contexts. For example, this single dashboard had to be split into three columns for the slide:
“It’s pretty nice looking; you can see many colors and many shapes and many things, but for someone that is not exactly into the data, it’s pretty useless,” said Sheqa.
Lesson #6: “You always should care about how much information you add to one single dashboard or to one single graph.”
Know Your Data
Last but not least, Sheqa pointed to the most common pitfall in visualization: creating graphs when you don’t understand the data.
“In my opinion, it’s absolutely necessary that you know what data you are collecting and understand each and every metric that you are collecting, so you can actually use it for debugging,” he said. “The best dashboards cannot help you to find an issue if you do not understand the underlying data that you collected before. When we think we know the data, but we don’t actually know them, this leads us usually to building wrong graphs and wrong dashboards. And this again leads us to misinterpretation.”