I have several copies of the data but I’m not sure which is the latest version.
I have the data, but I’m not sure I have the latest version.
I have a data file but I can’t remember what it contains.
I have data, but I can’t find it.
I don’t know how to interpret some of the variables.
I have corrected some errors and have been sent a new version based on the original data.
Limit Avoid duplication of data
Manage versions.
Document data (metadata)
Adopt a filename convention
Start a document of guidelines to follow
Combine :
data storage
data entry
visualisation (tables, format)
analysis (formulas, conditions, results, summaries, etc.)
figures
MS Excel interprets dates and stores them internally as a number… with different conventions for Mac and Windows
MS Excel interprets automatically certain texts as dates. E.g. the symbol for the gene “Oct-4” can be overwritten without notification.
A 2016 study found this type of error in 20% of published gene lists.
You will use the tools you master,
not necessarily those you need.
You can use them, but think about the principles of data management
There are only two hard things in Computer Science: cache invalidation and naming things.
Phil Karlton
NO
myabstract.docx
Jane's Filenames Use Spaces and Punctuation.xlsx
figure I.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt
Yes
2014-06-08_abstract-for-sla.docx
janes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt
Machine-readable
Human-readable
Works well with display order
Tools to help you develop a file naming convention tailored to your needs:
datacarpentry
.
├── data/
├── doc/
├── reports/
├── src/
└── Readme.md
I never modify files in data
.
All documentation to be read in doc
(e.g. description of data, articles, etc.)
The analysis work takes place in src
.
The results in reports
.
Importing and processing of data using different methods and tools
Understanding the structure of data
Maintain consistency (coding, capitalisation, format, etc.)
ISO 8601 standard for dates (YYYY-MM-DD)
Describing variables (metadata)
Separating data from derived calculations
Names: meaningful, short and descriptive
Back up
Store data in text files
To be able to transfer data with all the information needed to work with it.
To avoid errors in interpretation
Your primary collaborator is your past self,
but he doesn’t answer emails
Readme file
Description of variables (data-dictionary)
Itself a data set!!
Very vague suggestions, several choices and options
No matter what system or tools:
Associated text files
Repository with integrated metadata (e.g. Dataverse)
Specific formats
Data Management Plan
Choose one and respect the principles
Kristin Briney (2020) File naming convention worksheet
Data Carpentry (2018) Lesson on file organisation
Data Carpentry (2020) Tidy data for efficiency, reproducibility, and collaboration
Data Carpentry (2019) Data Organization in Spreadsheets for Ecologists
Karl W. Broman & Kara H. Woo (2018). Data organisation in Spreadsheets. The American Statistician, 72:1, 2-10.
Hadley Wickham (2014). Tidy Data. Journal of Statistical Software 58 (10).
Thank you