Stats Tips - Crossed comparison of group prevalences

Maxime made a bar plot displaying estimated prevalences of whatever in 12 different groups resulting from the cross-classification of 3 variables with 2, 2 and 3 categories respectively.

In addition, he wished to highlight the \(p\)-values for the significant comparisons across groups.

(a) Vector competence of *Culex pipiens pipiens* infected with USUV EU2 and EU3 lineages at high and low bloodmeal titer. Mosquitoes were examined for the presence of viral genome detected by RT-qPCR. The infection efficiency (IE) corresponds to the proportion of mosquitoes whose abdomens contain infectious viral particles among infected mosquitoes, the siddemination efficiency (DE) corresponds to the proportion of mosquitoes whose thorax contain infectious viral particles among infected mosquitoes and finally. Prevalence calculates by the generalized linear mix model. Error bars represent the standard error. Only the significant differences estimate by generalized linear mix models are represent on the plot.

I suggest replacing the bars by the simpler and cleaner point-and-ranges. Indeed, only the tip of the bar is really informative. We can safely get rid of the rest and direct the attention to the relevant bit.
I would also suggest switching the axes. It is easier to compare positions horizontally, and to read labels at the left hand side.
The representation of the groups is challenging, because there are many options for mapping the grouping variables to graphical elements.

In the original figure, the Tissue (Midguts, Thorax and Heads) is mapped to the \(x\)-axis; the Lineage (EU2, EU3) to the filling colour and the Bloodmeal titer (Low, High) to plot panels.

The choice should be made taking into account which comparisons one wishes to facilitate, or to induce. For instance, comparing neighbouring bars is easiest and most natural. Comparing bars across categories in the axis is not hard, but carries some more cognitive load to hop around the intermediate bars and to make sure you are looking at the comparable bars from each group. Finally, comparing corresponding bars from different panels is the hardest, from a visualisation perspective.

After discussing shortly with Maxime, it was clear that the prevalences across Tissues were the least comparable of all, as they correspond to 3 different parameters of vector competence (c.f. figure caption).

Indeed, in the plot he highlighted 3 contrasts across lineages and one contrast across bloodmeal titer. In all cases within the same tissue. In my opinion, considering a representation of lineage and bloodmeal titer within tissue works better

One option is to represent the inner 4 groups at the same level, and coding the distinction between variables with colour and some other graphical parameter such as line width or line type or even some colour variation of luminance or saturation.
I kept the same colour palette for lineage, which is colorblind safe and appropriate for a categorical variable which requires sharp contrast.

Bloodmeal titer, in contrast, is ordinal (high is more than low). So I initially thought that a variation of the 2 base colours would increased luminance would work nicely.

This gives the following basic reformulation of the initial figure that could work well on a journal article.

Now, I’m not a fan of the \(p\)-value annotations in the plot (c.f. #Statstips #18). I would rather remove them. But I’m not here to discuss that, so I’ll try to provide tips to display them anyway.

In the current plot, it is not easy to spot the groups that are being compared with each annotation (it was easier with the wider bars).

Instead, I would directly point to the corresponding groups with very light lines.
While the colours work well, I was not satisfied with the legend either. It describes to variables simultaneously and you need to look at it for a while to figure out the logic.

I thought that a bivariate legend should make this logic more explicit.

This resulted in the following second version.

That is better, but I still had doubts about the legend. It conveys the colour code very well, but the correspondence between the arrangements of the symbols in the figure and in the legend is broken, which still demands some cognitive effort to figure out how to read the plot.

So I abandoned the idea of coding both variables with variations of the base colours and switched to line width instead. I think the result is much more straightforward to read.

Finally, I think the caption should be more descriptive of the elements in the figure, and make it understandable on itself. The original caption is too long, provides information that belong to the text (e.g. detection by RT-qPCR, linear mixed model, vector competence…).

I’m overstepping a bit into the paper, but I’d suggest a simpler figure caption and leave the interpretation in terms of vector competence, the inference method and other details for the main text.

(a) Estimated prevalence (point ± standard error) of USUV by lineage and bloodmeal titer in the abdomen, thorax and heads of infected mosquitoes

Code