Stats Tips - A classical comparison of groups

Arnaud sends a classical problem of comparing numerical quantities across groups.

Specifically, he considers 2 numerical quantities that represent different measures of the same phenomena (codon usage, whatever it is), while the groups are formed by two lineages within six different genes.

He first proposed to use boxplots to visualise the variability within groups using one sub-figure for each target outcome.

(a) Comparison of codon usage for each gene of PPRV using coding sequences belonging to PPRV lineage II and lineage IV, as calculated with a) the mean codon adaptation index (CAI) using the genome of *Ovis aries* as reference; b) the effective number of codons (ENC).

This is a very standard and robust visualisation type, well understood by everyone.

The most important suggestion I would make here is to flip the axes to represent the distributions horizontally. This facilitates the comparison of the outcomes for the same gene, which are now aligned.

Otherwise, I removed some distracting elements (background and grid lines), removed the redundant legend, spelled out the acronyms in the axes for clarity and changed the colour palette.

The default ggplot colour palette for a categorical variable such as Lineage is not bad. It is colour-blind safe, and both colours have similar intensity while being very distinctive. I just wanted to propose an alternative option with the same good properties.

This gives the following basic reformulation of the initial figure that could work well on a journal article.

This still has a couple of issues:

The redundant Gene axis
The concern that the boxplots might be hiding some unexpected patterns. E.g.:

So a fancier solution would be to use a violin-plot instead or other form of density estimate of the underlying data distribution.

If we had only a few data points, we would have preferred to display each point individually.

Furthermore, taking advantage of the fact that the main interest is the comparison of only two lineages, I presented the densities face-to-face within each gene to emphasize this contrast further.

Finally, I displayed the two outcomes as two facets of the same plot to get rid of the redundant axis and changed the colour palette once more, and adapted the figure caption accordingly.

(a) Kernel density estimates of codon usage for each gene of PPRV using coding sequences belonging to PPRV lineage II and lineage IV, as calculated with a) the mean codon adaptation index (CAI) using the genome of *Ovis aries* as reference; b) the effective number of codons (ENC). The point-intervals at the base indicate median, .66 and .95 quantiles.

Code