Principles and practices for managing and structuring data

Facundo Muñoz

Data storage

Somme common issues

  • I have several copies of the data but I’m not sure which is the latest version.

  • I have the data, but I’m not sure I have the latest version.

  • I have a data file but I can’t remember what it contains.

  • I have data, but I can’t find it.

  • I don’t know how to interpret some of the variables.

  • I have corrected some errors and have been sent a new version based on the original data.

Wasted time

Unnoticed errors

Some principles

  1. Limit Avoid duplication of data

  2. Manage versions.

  3. Document data (metadata)

  4. Adopt a filename convention

Start a document of guidelines to follow

Spreadsheets

Combine :

  • data storage

  • data entry

  • visualisation (tables, format)

  • analysis (formulas, conditions, results, summaries, etc.)

  • figures

The requirements for data entry, storage and visualisation are fundamentally different

Horror stories

  • MS Excel interprets dates and stores them internally as a number… with different conventions for Mac and Windows

  • MS Excel interprets automatically certain texts as dates. E.g. the symbol for the gene “Oct-4” can be overwritten without notification.

    A 2016 study found this type of error in 20% of published gene lists.

You will use the tools you master,

not necessarily those you need.

You can use them, but think about the principles of data management

Naming files (and variables)

There are only two hard things in Computer Science: cache invalidation and naming things.

Phil Karlton

NO

myabstract.docx
Jane's Filenames Use Spaces and Punctuation.xlsx
figure I.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt

Yes

2014-06-08_abstract-for-sla.docx
janes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt

3 principles for file names

  1. Machine-readable

  2. Human-readable

  3. Works well with display order

Format standard ISO 8601

Excellent file names

Naming files (and variables)

Tools to help you develop a file naming convention tailored to your needs:

Organising files

datacarpentry

My basic project structure

.
├── data/
├── doc/
├── reports/
├── src/
└── Readme.md

Principles

  1. I never modify files in data.

  2. All documentation to be read in doc (e.g. description of data, articles, etc.)

  3. The analysis work takes place in src.

  4. The results in reports.

Structuring data

Principles

Facilitate:

  1. Importing and processing of data using different methods and tools

  2. Understanding the structure of data

Some bad habits

Using multiple tables

Encode information as cell formatting

Empty cells

Non-rectangular structures

Good practices

  • Maintain consistency (coding, capitalisation, format, etc.)

  • ISO 8601 standard for dates (YYYY-MM-DD)

  • Describing variables (metadata)

  • Separating data from derived calculations

  • Names: meaningful, short and descriptive

  • Back up

  • Store data in text files

Tidy data

The description of a data set

Principles

  1. To be able to transfer data with all the information needed to work with it.

  2. To avoid errors in interpretation

Your primary collaborator is your past self,

but he doesn’t answer emails

Meta-data

  • Readme file

    • Brief description of the project, data source and collection methods, context, objectives, references, contacts, etc.
  • Description of variables (data-dictionary)

    • The name of the variable as it is used
    • Version of the name adapted for display
    • Type of variable (categorical, quantitative, etc.)
    • Units of measurement
    • Possible range of variation, possible values
    • Description of the variable

Data dictionary

Itself a data set!!

Conclusions

  • Very vague suggestions, several choices and options

  • No matter what system or tools:

    • Associated text files

    • Repository with integrated metadata (e.g. Dataverse)

    • Specific formats

    • Data Management Plan

  • Choose one and respect the principles

References

Thank you