Engineering Decision Making and Risk Management: visualization

Wednesday, May 6, 2020

Scary Spider Chart

Spider chart. Creator: redacted.

The spider chart shown in this post came from a dissertation that studied multi-agent systems. A spider chart is also known as a radar chart, and writers use it to graph multivariate data. In a typical example, each alternative has a polygon that connects the points that represent its performance on multiple measures, with one point on one spoke for each measure. It can be effective to show how one alternative performs well on one or more measures and another alternative performs well on another set of measures. In that case, the polygons for these alternatives will have different shapes.
Conversely, the polygons for two alternatives will be very similar if the performance of those alternatives is similar across the different measures.

Unfortunately, the spider chart shown here has reversed the typical use. There are two alternatives (adaptive risk and fixed risk) that were tested under three scenarios (low, medium, and high workload). In this chart, there are six spokes, one for each combination; a typical chart would have six polygons. Instead, there are three polygons, one for each measure: profit, completed percentage, and failed percentage. The last two measures always add to 100%, so one of them is redundant.

Thus, there are only twelve useful data points (two measures for six combinations). The data-to-ink ratio is very low. Given the small amount of data, a simple table may be the best way to convey this information.

The purpose of this spider chart is to show how the two alternatives compare on the performance measures, but this chart makes that comparison very difficult. A reader normally relies upon the slope of a curve (either positive or negative) to determine how performance is changing, but that will not work here because the different scenarios have different orientations and one performance measure is the complement of the other.

Because there are effectively two performance measures, a two-dimensional scatter plot (with appropriate labels) would have been appropriate. The second chart (which I created using the same data) is a possibility; this makes the change from fixed risk to adaptive risk more clear, but it still has a low data-to-ink ratio.

Two-dimensional scatter plot. Creator: Jeffrey W. Herrmann.

Friday, December 29, 2017

Visualizing a distribution

Understanding how the students in my course did on an exam includes visualizing the distribution of their scores. In the past I used two standard visualizations: a histogram of the letter grade and a scatter plot that showed the distribution (similar to a cumulative distribution function).

For example, consider a final exam that 49 students took. The exam was worth 35 points, so the thresholds for letter grades (90% = "A," etc.) were 21, 24.5, 28, 31.5. After converting each student's score to a letter grade, I generated a histogram of the letter grades (Figure 1), which clearly shows that the most common grade was a "B." To get more details, I also generated the full distribution of the exam scores (Figure 2).

Figure 1. The histogram.

Figure 2. The full distribution (each dot is one score).

Although the vertical gridlines in the full distribution are on the thresholds for letter grades, there is a great deal of wasted space, and it is not obvious on this chart by itself which letter grade was most common. (In a larger data set, the vertical distance between points would have to shrink and could become too small to distinguish the markers.) To overcome this limitation, I changed the cumulative count to restart at each threshold, which yielded the chart (which I am calling a "distrogram") in Figure 3.

Figure 3. The distrogram.

There are now four "curves," one for each letter grade, and each curve shows the distribution of scores in that letter grade. The height of each curve shows the total number of scores in that letter grade (as the histogram does). Thus the distrogram clearly shows that the most common grade was a "B." It also shows that the scores that correspond to "C" (between 24.5 and 28) were near the threshold for a "B" and that no one earned a perfect score (a 35). Thus, this distrogram provides the same information as the histogram plus additional information in a layout that is easier to navigate than the full distribution (the second chart). Because there are multiple curves, there is more vertical space between markers. Compared with the full distribution, however, the distrogram does require more operations to answer a distribution question such as "How many students earned at least 30 points on the exam?" because one would have to add the counts for multiple bins.

A distrogram should be useful for numerical data where the individual values and their grouping into categories (such as letter grades) based on these values are important. The key feature is that the simple bar in the histogram is replaced by a scatter plot showing the distribution of the values in that bin. Markers are needed to show the individual values; lines connecting the markers are not necessary; if they are used, they should be light so that the markers are easy to see. Horizontal and vertical gridlines should also be light if used.

To create a distrogram, define the thresholds for the bins and sort the values in ascending order. Determine the bin for each value. Add a cumulative count that starts at 1 and increases by 1 at each value (even if this value and the previous one are equal). Reset the cumulative count to 1 when the new value and the previous value are in different bins. Create a scatter plot, with the values on the horizontal axis and the cumulative count on the vertical axis