Stats Tips

Author

Facundo Muñoz

This is a collection of short and basic statistical (in a broad sense that includes data management, analysis and visualisation) tips, written in a positive and constructive way and inspired by many years of collaborating with researchers with various backgrounds.

They should be applicable to mostly anyone analysing data.

I emailed them weekly to UMR ASTRE colleagues as a way of stimulating reflection and exchanging ideas.

Contributions are welcome

#1 Collect data with as much detail as possible and summarise later

When you record data from field observation or lab experiments, do it at the finest scale available.

E.g. If you collect age of individuals or time to event, record the number of days, or months or years. You will always be able to make groups for analysis later. Not the converse.

#2 Use consistent units within variables (columns)

For each variable (column), choose an appropriate unit and use it systematically in all records. Document the unit in the variable name, or even better, in a data-dictionary. Not in the cells, which should contain only the values.

E.g. Instead of a variable named “age” with values “3 days”, “1 wk”, “11 D”, use a variable “age_days” with numeric values “3”, “7”, “11”.

#3 May your zeroes not be missing

While a missing value is a value that could not be observed for whatever reason, a 0 is a observed, perfectly valid value. Missing values and zeroes have very different meanings. Avoid using empty cells for implying a value of 0.

E.g. Write a variable as “n_ticks: 8, 4, 0, , 2, , 1, 0…” using empty cells for missing, unobserved values.

#4 Separate data collection from data processing

Store collected data in read-only files in an interoperable standard format and carry out all analyses and calculations such as summaries, derived variables, categorisations, etc, in separate files using the language or tool of your choice.

E.g. Instead of adding a calculated column “avg_parisitic_load” to your Excel data file, keep the observed variables “n_ticks” and “n_horses” in a CSV file and import it into Excel or R for analysis in separate files.

#5 Favour English for recording data and naming variables

Data might (hopefully) be used and re-used in the future by other people for other projects. Writing values and labels in the vehicular language of science improves its (re)usability.

E.g. Write a variable as “day: monday, thursday…” rather than “jour: lundi, jeudi…”

#6 Use the point (.) as decimal separator

The current standards admit both the point and the comma as the symbol for the decimal marker. However, in a international scientific context, it is convenient to default to the English standard, which is the point (dot). Set your locale configuration in your operating system and data analysis software accordingly.

E.g. Write a value as 3.14159 rather than 3,14159.

#7 Use validation tools for data entry

Software or services providing user interfaces minimise redundancy and mistakes during data entry. Saving time and preventing unnoticed errors with potential impact in results.

E.g. Software such as REDCap or KoboToolbox. Cirad has a KoboToolbox service at https://kf.cirad.fr/.

#8 Tidy your data

Data are tidy when they are organised into simple, rectangular tables where each variable forms a column, each observation forms a row and each cell is a single measurement. “Tidy datasets are all alike, but every messy dataset is messy in its own way.” (H. Wickham). Storing data in tidy format is easier and safer to understand, to document, to process, to communicate, to share and to collaborate upon.

E.g. If your observations are structured in groups, instead of inserting rows flagging the start of each group or making multiple sub-tables, consider adding a variable “group” with the same value for all observations belonging to the same group.

#9 One table = one entity

It is good practice to store information about different entities into separate data tables. Each row would have a unique identifier which can be used in other tables to represent relationships. This relational structure avoids redundancies and ensures consistency.

E.g. Instead of storing sampling information in a single table with variables sample_id, date, value, animal, age_animal, farm, coordinates, consider using a table samples with variables id, animal_id, date and value; a table animals with variables id, farm_id and age; and a table farms with variables id and coordinates.

#10 Use data repositories for scientific data

Storing data in a data-repository has multiple advantages. It provides: a back up copy of your data which prevents accidental loses; a central authoritative reference for the last, definitive version of the data which improves collaboration; version-control functionality; standardised meta-data documentation making it more FAIR; a persistent DOI for publication.

E.g. Cirad’s institutional repository is at https://dataverse.cirad.fr/. See the guide How to publish a data set in Dataverse (in French).

#11 Create a data-dictionary

A data-dictionary is part of the meta-data consisting itself in a data table with information about the variables in the data (possibly from several tables). a version of the variable name suitable for visualisations (e.g. with spaces, capitalisation, etc.), any additional notes.

E.g. A table with columns for the variable name (as it appears in the data), the corresponding table (sheet or file name), a description of the variable, valid categorical values or numerical ranges, measurement units,

#12 Write dates as YYYY-MM-DD

This is the international standard ISO-8601 format for dates. Using standards improves interoperability. Moreover, lexical and chronological ordering conveniently match when dates are written this way.

E.g. Todays’ version of a manuscript named as 2023-01-29_manuscript-wip.docx will be automatically sorted first (or last) in your file browser, making it easier to spot the last version from a list.

#13 Fill in all cells

Avoid leaving empty cells to imply a long series of the same value. This might be useful for data-entry or data-visualisation. But is a very bad idea for data-storage and makes processing difficult and error-prone. Each row in a table should be self-sufficient and independent of its ordering in the table. Empty cells might still be used for missing values, if duly documented.

E.g. Even if many observations were performed the same date, it is best to specify them in all rows, rather than specifying only the first line of each block of dates.

#14 Use consistent codes for categorical variables

Categorical variables such as country or sex contain values from a specific set of alternative character strings. Manually typing these values during data entry quickly leads to different spellings, capitalisations, leading or trailing spaces, etc. that cause the same values to be interpreted as different in an automated processing. Make sure that the codes are used consistently (e.g. by using data-entry tools, stat-tip #7) and explicitly document the possible values of a categorical variable in a data-dictionary (stat-tip #11).

E.g. Instead of using male, mâle, Male, m or M interchangeably in a variable sex, choose one code and use it consistently.

#15 Choose short and meaningful names

Quoting Phil Karlton, “There are only two hard things in Computer Science: cache invalidation and naming things”. Assigning names (to files, variables, functions, objects, etc.) can be mistakenly seen as a necessary but trivial, accessory and inconsequential activity. However, it bears a great impact on how easily the nature, meaning, content and logic of the objects are grasped. Conversely, bad names entails more chance of errors and misinterpretations, and more cognitive effort and time necessary to operate than necessary. It is largely worth spending some time to choose concise, precise and sufficiently descriptive names. Avoid spaces and special characters (e.g. accents) to facilitate processing.

E.g. 2023-01-29_manuscript_ppr-nigeria.docx is a much more descriptive name than the generic manuscript.docx.

#16 Encode all relevant information in a specific variable

Using background colour of cells, asterisks or notes in parenthesis next to the cell value might be practical for visualisation. For data storage and processing, it is better to encode this information on its own variable (c.f. Stats-Tips #8)

E.g. Instead of using a red background for observations that are doubtful or problematic, include an additional categorical variable to indicate the reliability of the measurement.

#17 Correlation does not imply causation, but causation does not imply correlation either

It is widely recognised that a correlation pattern among variables does not imply a causal relationship. There can be a confounding factor, reverse causality or even spurious correlations. Nevertheless, this mistaken interpretation is still quite prevalent in the literature. What is even less recognised is the converse. A causal relationship need not be reflected in a correlation or in a more general association. Meaning that the absence of association does not imply absence of causal relationship either.

E.g. The internal body temperature in mammals is directly influenced by external temperature. However, it also influences the self-regulatory system that keeps the body temperature constant and seemingly independent from the external conditions.

#18 Statistical significance and scientific relevance are different things

While “significant” is synonym with “important” or “meaningful” in colloquial terms, it has a very distinct and specific meaning in Statistics. A significant effect results from an observation that would be relatively unexpected in the absence of the effect, given the amount of information available. Thus, with sufficient observations, even negligible, irrelevant effects can be detected. Conversely, lacking sufficient data, important effects can easily go unnoticed.

E.g. A antihypertensive might have proven to “significantly” reduce blood pressure by 1 mmHg on average, thanks to a extensive clinical trial. However, this effect might well be clinically irrelevant.

#19 Favour modelling over (multiple) testing

Many common statistical tests (Anova, Ancova, t-test, Wilcoxon, Chi square, Kruskal-Wallis, Mann-Whitney…) are special cases of the Linear Regression Model. Instead of figuring out which test applies to each situation and running multiple tests to obtain partial answers that neglect all the rest, it is often much better to build a single Regression Model that explains the data, from which multiple questions can be answered at once while controlling for the other factors that influence the outcome.

E.g. It is a really bad idea to conduct multiple univariate tests in order to choose variables to be included in a regression model. They are completely misleading and invalidate the model results. Go for the model directly.

#20 Use the VIF for assessing multi-collinearity

In order to address multi-collinearity in a regression, people sometimes check the pairwise correlation between variables in order to detect (and eventually, remove) variables that are strongly associated. However, variables can be (and often are) orthogonal when considered in subgroups and yet jointly dependent. Better use the Variance Inflation Factor (VIF), which measures the extent to which each independent variable is explained by the others.

E.g. Choose a square in a chess board at random. Row and column are independent variables. So are row and colour and also column and colour. Yet, row and column fully determine the colour of the square.

#21 The “best” predictive and explanatory models are not necessarily the same

Model selection procedures (including variable selection, e.g. using information criteria, or cross-validation) optimise the predictive performance of the model. That is, the accuracy with which the model predicts new outcomes. This is a purely predictive goal, which is not necessary aligned with the true underlying process. If you are interested in understanding the process, or the actual influence of some variable (e.g. risk factors) on an outcome, you need to consider causal relationships that might be impossible to discern using observed data alone.

E.g. Suppose that you are interested in the effect of rainfall on mosquito abundance. This effect is indirect: rainfall increases the surface of water ponds which in turn provide breeding sites for insects. For predictive purposes, you would only use the size of water ponds as a explanatory variable, since rainfall does not add any further information. Yet, for quantifying the target effect, you need to avoid adjusting on the size of water ponds by excluding the variable from the model.

#22 Separate exploratory data analysis from frequentist inference

Exploratory techniques such as variable selection or the assessment of linearity and distributional assumptions should be performed on a different data set than the one used for confirmatory data analysis. Frequentist p-values and confidence intervals are otherwise invalid, since they assume that the model is stated in advance. In contrast, Bayesian methods do not suffer from this constraint.

E.g. For any response variable of interest, you will eventually find a significant effect if you ‘explore’ sufficiently many variables. Even if none of them have any real effect. ‘Significance’ is meaningless in this context.

#23 The difference between “significant” and “not significant” is not itself statistically significant

When comparing quantities between groups or changes in time, evaluate the difference directly, rather than comparing the degree of statistical significance. Even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities.

E.g. Consider two groups with effect estimates and standard errors of 25 ± 10 and 10 ± 10. The first effect is statistically significant at the 1% level, and the second is not at all statistically significant, being only one standard error away from 0. It would be tempting to conclude that there is a large difference between the two groups. However, the difference is not even close to being statistically significant: the estimated difference is 15, with a standard error of \(\sqrt{10^2 + 10^2} = 14\).

#24 Leverage prior information

More often than not, you have more or less accurate expectations about the size of an effect or the variability of some quantity. Either from previously published studies, from analogies with similar situations or simply from common sense. You can improve your statistical inferences by encoding this knowledge into prior distributions for the appropriate parameters of your model, and fitting the model to the observed data using Bayesian methods. This is especially beneficial when sample sizes are small, variations are large, models are complex or observations are unexpected.

E.g. You observe 2 positives in a sample of 3 subjects (i.e. 66% positive rate), while your expectation for the positive rate would be around 10%. A simple significance test rejects this hypothesis. Whereas a Bayesian analysis shows that, in this low-power situation, the significant result is most likely a false positive.

#25 Embrace the continuity of continuous variables

Binning continuous variables into 2 or more categories may simplify modelling and interpretation, but it is completely unnecessary and creates more problems than it solves. Dichotomisation leads to a considerable loss of information (or power) and incomplete correction for confounding factors. The determination of cut-off points introduces an additional source of uncertainty and can lead to serious biases. Continuous variables are most informative as they are, unless the categories have a pre-established theoretical justification and interest.

E.g. If you are studying the impact of an intervention over time, it is better to model continuous non-linear effects than to estimate the average effects on weeks 1, 2,…

#26 Centre and scale continuous predictors in regression models

It is generally a good idea to center and scale predictive variables to be used in regression models. In a linear predictor of the form \(\eta = \alpha + \beta X_c\), where \(X_c = X – \bar X\), the intercept \(\alpha\) represents the expected value of the linear predictor at the average value of the exposure \(\bar X\). This is both easier to interpret and can have numerical advantages. Moreover, scaling the predictor (e.g. dividing by its standard deviation) allows interpreting the coefficient \(\beta\) in terms of the impact of a 1 standard deviation increase of \(X\) on the linear predictor, which is specially relevant for comparing the relative effects of multiple predictors with different units.

E.g. In a linear regression with animal weight as a predictor, the intercept represents the expected outcome for an individual of 0 kg. Using z-scores instead (centering and scaling), the intercept is the expected outcome for an individual of average weight, and the coefficient represents the increase in expected outcome for an individual one standard deviation heavier than average. This roughly reflects a typical difference between the average individual and a randomly drawn observation.

#27 Use logarithmic transformations of positive variables to overcome additivity and linearity assumptions

It commonly makes sense to take the logarithm of outcomes that are all-positive. A linear model on the logarithmic scale corresponds to a multiplicative model on the original scale. This allows to express a more general class of relationships between exposures and outcome.

E.g. In a linear regression of the form \(Y = \alpha + \beta\, X + \varepsilon\), a unit increase in \(X\) is associated with \(\beta\) units increase in \(Y\) (additive effect). If instead you use \(\log(Y)\) as the outcome, a unit increase in \(X\) is associated with \(100\,\exp(\beta)\) percent increase in \(Y\) (multiplicative effect). If \(\log(X)\) is used as the exposure, one percent increase in \(X\) is associated with \(\beta/100\) units increase in \(Y\). If both the exposure and the outcome are log-transformed, one percent increase in \(X\) is associated with \(\beta\) percent increase in \(Y\).

#28 Inferences about individuals drawn from aggregated data are hazardous

Modelling rates, average or total quantities over groups of individuals may reveal statistical associations between exposures and outcomes. But these associations are not always applicable at an individual level and neither the conclusions derived from them, as they are subject to “aggregation bias”. A consequence of the so-called “Modifiable Aerial Unit Problem”. Extrapolating inferences across different levels of aggregation is known as the “Ecological Fallacy”.

E.g. In 19th century Europe, suicide rates were higher in countries that were more heavily Protestant. The conclusion that the social conditions of Protestantism promoted suicide in incorrect, though. First, Protestant and Catholic countries differed in many ways besides religion (confounding). But more specifically, the available data were at national levels, and there is no evidence linking the individual suicides to any particular religious faith.

Addicott, Ethan T, Eli P Fenichel, Mark A Bradford, Malin L Pinsky, and Stephen A Wood. ‘Toward an Improved Understanding of Causation in the Ecological Sciences’. Frontiers in Ecology and the Environment 20, no. 8 (October 2022): 474–80. https://doi.org/10.1002/fee.2530

Altman, Naomi, and Martin Krzywinski. ‘P Values and the Search for Significance’. Nature Methods 14, no. 1 (1 January 2017): 3–4. https://doi.org/10.1038/nmeth.4120

Broman, Karl W., and Kara H. Woo. (2018). Data Organization in Spreadsheets. The American Statistician 72 (1): 2-10. https://doi.org/10.1080/00031305.2017.1375989

Bryan, Jenny (2015). Naming things. Reproducible Science Workshop. https://speakerdeck.com/jennybc/how-to-name-files

Fortuno, Sophie (2018). Dataverse - Guide Utilisateur V1. http://agritrop.cirad.fr/587424/

Gelman, Andrew, and Hal Stern (2006). ‘The Difference between “Significant” and “Not Significant” Is Not Itself Statistically Significant’. The American Statistician 60, no. 4: 328–31. https://doi.org/10.1198/000313006x152649.

Grolemund, G & Wickham, H (2016). R for Data Science. https://r4ds.had.co.nz

Hoyt, Peter R., Christie Bahlai, Tracy K. Teal (Eds.), Erin Alison Becker, Aleksandra Pawlik, Peter Hoyt, Francois Michonneau, Christie Bahlai, Toby Reiter, et al. (2019, July 5). datacarpentry/spreadsheet-ecology-lesson: Data Carpentry: Data Organization in Spreadsheets for Ecologists, June 2019 (Version v2019.06.2). Zenodo. http://doi.org/10.5281/zenodo.3269869

Lin, Iyar. ‘“Correlation Does Not Imply Causation”. So What Does?’ Just be-cause, February 2019. https://iyarlin.github.io/2019/02/08/correlation-is-not-causation-so-what-is/.

Lindeløv, Jonas K. (2019). Common statistical tests are linear models (or: how to teach stats). https://lindeloev.github.io/tests-as-linear/

Lowndes, Julie and Allison Horst (2020). Tidy data for efficiency, reproducibility, and collaboration. Openscapes blog. https://www.openscapes.org/blog/2020/10/12/tidy-data/

Makin, Tamar R, and Jean-Jacques Orban de Xivry (2019). ‘Ten Common Statistical Mistakes to Watch out for When Writing or Reviewing a Manuscript’. ELife 8: e48175. https://doi.org/10.7554/eLife.48175.

Munroe, Randall (2011). ‘Significant’. xkcd. https://xkcd.com/882/.

Siegfried, Tom (2010). ‘Odds Are, It’s Wrong: Science Fails to Face the Shortcomings of Statistics’. Science News 177, no. 7: 26–29. https://doi.org/10.1002/scin.5591770721.

Wickham, Hadley (2014). Tidy Data. Journal of Statistical Software 58 (10). jstatsoft.org/v59/i10/

Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. ‘A Protocol for Data Exploration to Avoid Common Statistical Problems’. Methods in Ecology and Evolution 1, no. 1 (2010): 3–14. https://doi.org/10.1111/j.2041-210X.2009.00001.x

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.