These are general guidelines for preparing tabular data for inclusion in a repository or for sharing it with other researchers, in order to maximize the likelihood of long-term preservation and potential for reuse. Individual repositories may have different or more specific guidelines than those presented here.
Organize tabular data into rows and columns. Each row represents a single record or data point, while columns contain information pertaining to that record. Each record or row in the data set should be uniquely identified by one or more columns in combination.
Tabular data should be "rectangular" with each row having the same number of columns and each column the same number of rows. Fill every cell that could contain data; this is less important for cells used for comments. For missing data, use the conventions described below.
Column headings should be meaningful, but not overly long. Do not duplicate column headings within a file. Assume case-insensitivity when creating column headings. Use only alphanumeric characters, underscores, or hyphens in column headings. Some programs expect the first character to be a letter, so it is good practice to have column headings start with a letter. If possible, indicate units of measurement in the column headings and also specify measurement units in the metadata.
Use only the first row to identify a column heading. Data import utilities may not properly parse column headings that span more than one row.
Examples of good column headings:
max_temp_celsius - not max temp celsius (includes spaces)
airport_faa_code - not airport/faa code (includes special characters)
Data Values and Formatting
Special Types of Data - Date/Time
Special Types of Data - Missing Data
Use a standard method to identify missing data; do not use zeroes or leave a cell blank. Select a code to identify missing data; using -999 or -9999 is a common convention. Indicate the code for missing data in the metadata. When exporting data to another format, check to ensure that no cells with missing data have zeros or are blank in the resulting file.
Consider performing basic data quality assurance to detect errors or inconsistencies in data. Here are some common techniques:
Providing summary information about the data and including it in the metadata helps users verify they have an uncorrupted version of the data. This information might include number of columns; max, min, or mean of parameters in data; number of missing values; or total file size.
OpenRefine (formerly GoogleRefine) is a very useful tool for exploring, cleaning, editing, and transforming data. Advanced operations can be performed on data using GREL (OpenRefine Expression Language).
The preceding guidelines have been adapted from several sources, including:
Best Practices for Preparing Environmental Data Sets to Share and Archive. Hook, L.A., Beaty, T.W., Santhana-Vannan, S., Baskaran, L., & Cook, R.B. 2007. http://daac.ornl.gov/PI/bestprac.html
Ecological Data: Design, Management and Processing. Michener, W.K. & Brunt, J.W. (Eds.). 2000.
Guide to Social Science Data Preparation and Archiving. Inter-university Consortium for Political and Social Research. 2009. http://www.icpsr.umich.edu/access/dataprep.pdf
Some Simple Guidelines for Effective Data Management. Borer, Elizabeth T., Eric W. Seabloom, Matthew B. Jones, and Mark Schildhauer. Bull. Ecol. Soc. Am. 90(2)205-214. 2009. http://www.nceas.ucsb.edu/files/news/ESAdatamng09.pdf