Best Practices for Data Tables

Introduction

Overview

Teaching: 2 min
Exercises: 0 min
Questions
  • What will we cover this morning?

Objectives

Deep Dive into Data Management

So far we have tackled the general concepts around data management: F.A.I.R concepts and the data life cycle. And where BCO-DMO fits in that picture.

recap

F.A.I.R geared toward data curators rather than individual researchers, and to facilitate machine accessibility of data in addition to human readability. But you should endeavor to use resources that help move your data closer to being FAIR. BCO-DMO engages in all of the FAIR principles.

There are steps individual researchers can take to increase the “fairness” of their data. This will be the focus of today. We will dive deeper into hands-on management of data tables and the general practices of (re)using a data table.

Part 1: Good approaches to Tabular Data Formatting

Spreadsheets are good for data entry. We have a lot of data in spreadsheets. Much of your time as a researcher will be spent in this ‘data wrangling’ stage. It’s not the most fun, but it’s necessary. We’ll teach you how to think about data organization and some practices for more effective data wrangling. After this lesson, you will be able to:

Table of content:

  1. Tabular Data and Spreadsheets
  2. Common formatting mistakes & exercise
  3. Discussion
  4. Dates are special
  5. Exporting data

Part 2: Introduction to Data Reuse, Access and Provenance

After this lesson, you will be able to:

Table of content:

  1. Assessing metadata
  2. Download data
  3. Data manipulation in spreadsheets
  4. What is next?

Key Points


Tabular Data & Spreadsheets

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Why focusing on spreadsheets?

  • What are the drawbacks?

  • What other data types exist?

Objectives
  • Use spreadsheets appropriately in research projects”

Data organization and the data life cycle

Good data organization is the foundation of your research project and should already be taken into account in the acquisition phase of a research project.

data-life-cycle-acquisition

The structure and format of datasets can be planned and described in the proposal stage of a research project and is captured in a document called the data management plan (DMP).

A data management plan (DMP) is a written document that describes the data you expect to acquire/generate during the course of a research project, how you will manage, describe, analyze, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data.

Important to note that this is a living document and can be adjusted as you move forward. But it helps as a framework at the start.

dmp-data-organization

Spreadsheets for Data Entry

Most researchers have data or do data entry in spreadsheets. Spreadsheet programs are very useful graphical interfaces for designing data tables and handling basic data quality control.

Many spreadsheet programs are available. Since most participants utilize Excel as their primary spreadsheet program, this lesson will make use of Excel examples.

data-types

NOTE: Excel is proprietary software, which means that it is owned by a company that is restricting the ways it can be used. A free spreadsheet program that can also be used is LibreOffice. Commands may differ a bit between programs, but the general idea is the same.

data-types

Exercise

  • How many people have used spreadsheets in their class?
  • How many people have accidentally done something that made them frustrated or sad?

Spreadsheets encompass a lot of the things we need to be able to do as researchers. We can use them for:

Exercise

  • What kind of operations do you do in spreadsheets?
  • Which ones do you think spreadsheets are good for?

Problems with Spreadsheets

Spreadsheets are good for data entry, but in reality we tend to use spreadsheet programs for much more than data entry. We use them to create data tables for publications, to generate summary statistics, and make figures.

Generating tables for publications in a spreadsheet is not optimal - often, when formatting a data table for publication, we’re reporting key summary statistics in a way that is not really meant to be read as data, and often involves special formatting (merging cells, creating borders, making it pretty).

publication-layout

The latter two applications, generating statistics and figures, should be used with caution: because of the graphical, drag and drop nature of spreadsheet programs, it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else’s), particularly if your stats or figures require you to do more complex calculations.

Furthermore, in doing calculations in a spreadsheet, it’s easy to accidentally apply a slightly different formula to multiple adjacent cells. When using a command-line based statistics program like R or SAS, it’s practically impossible to apply a calculation to one observation in your dataset but not another unless you’re doing it on purpose.

datatable-figure

However, there are circumstances where you might want to use a spreadsheet program to produce “quick and dirty” calculations or figures, and data cleaning will help you use some of these features. Data cleaning also puts your data in a better format prior to importation into a statistical analysis program.

Not all data is tabular

The focus of this class is tabular data. However, it is important to note that there are other data types out there such as documents, experimental data, fasta files, spectral counts, code, images, maps.

data-types

Key Points

  • Data organization starts at the sampling phase of a reserach project.

  • Spreadsheets are good for data entry, but we use them for a lot more like formatting tables for publication and figures.

  • Not all data is tabular.


Formatting Data Tables

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • Why is formatting data tables important?

  • Why should I track analysis?

  • How to structure data in spreadsheets?

Objectives
  • Understand difference between spreadsheets and lab notebooks

  • Describe best practices for data entry and formatting in spreadsheets.

  • Apply best practices to arrange variables and observations in a spreadsheet.

Spreadsheets are not lab notebooks

We know now that we can use spreadsheets for data entry. The most common mistake made is treating spreadsheet programs like lab notebooks, that is, relying on context, notes in the margin, spatial layout of data and fields to convey information. As humans, we can (usually) interpret these things, but computers don’t view information the same way, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fits together.

Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal).

data-life-cycle-acquisition

Spreadsheet-like lab notebook:

excel-noteook

You can set things up in different ways in spreadsheets, but some of these choices can limit your ability to work with the data in other programs or have the you-of-6-months-from-now or your collaborator work with the data. These are the interoperable and reusable parts of F.AI.I.R data principles.

Note: the best layouts/formats (as well as software and interfaces) for data entry and data analysis might be different. It is important to take this into account, and ideally automate the conversion from one to another.

Keeping track of your analyses

When you’re working with spreadsheets, during data clean up or analyses, it’s very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when Reviewer #3 asks for a different analysis, you should

This might be an example of a spreadsheet setup:

spreadsheet setup

Put these principles in to practice today during your Exercises.

Structuring data in spreadsheets

The cardinal rule of using spreadsheet programs for data is to keep it “tidy”:

  1. Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’.
  2. Put each observation in its own row.
  3. Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
  4. Leave the raw data raw - don’t change it!
  5. Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.

For instance, we have data from a survey of small mammals in a desert ecosystem. Different people have gone to the field and entered data into a spreadsheet. They keep track of things like species, plot, weight, sex and date collected.

If they were to keep track of the data like this:

multiple-info example

the problem is that species and sex are in the same field. So, if they wanted to look at all of one species or look at different weight distributions by sex, it would be hard to do this using this data setup. If instead we put sex and species in different columns, you can see that it would be much easier.

Columns for variables and rows for observations

The rule of thumb, when setting up a datasheet, is columns = variables, rows = observations, cells = data (values).

So, instead we should have:

single-info example

Discussion

The dataset that will be used during the next exercise is the Portal Project Teaching Dataset.

These data are observations of a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.

This is a real dataset that has been used in over 100 publications. We’ve simplified it just a little bit for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.

Exercise

We’re going to take a messy version of the survey data and describe how we would clean it up.

  1. Download the data by clicking here to get it from FigShare.
  2. Open up the data in a spreadsheet program.
  3. You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014, and they both kept track of the data in their own way. Now you’re the person in charge of this project and you want to be able to start analyzing the data.
  4. Identify what is wrong with this spreadsheet. Also discuss the steps you would need to take to clean up the 2013 and 2014 tabs, and to put them all together in one spreadsheet.

Important Do not forget our first piece of advice: to create a new file (or tab) for the cleaned data, never modify your original (raw) data.

After you go through this exercise, we’ll discuss as a group what was wrong with this data and how you would fix it.

Solution

  • Take about 10 minutes to work on this exercise.

Key Points

  • Computers need to be able to understand data tables

  • Never modify your raw data.

  • Keep track of all of the steps you take to clean your data.

  • Organize your data according to tidy data principles.


Discussion Formatting Problems

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What are some common challenges with formatting data in spreadsheets and how can we avoid them?

Objectives
  • Recognize and resolve common spreadsheet formatting problems.

Common Spreadsheet Errors

There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them and be part of F.A.I.R. data creation.

Making small changes to the way you format your data in spreadsheets can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.

Using multiple tables

A common strategy is creating multiple data tables within one spreadsheet. This confuses the computer, so don’t do this! When you create multiple tables within one spreadsheet, you’re drawing false associations between things for the computer, which sees each row as an observation. You’re also potentially using the same field name in multiple places, which will make it harder to clean your data up into a usable form. The example below depicts the problem:

multiple tabs

In the example above, the computer will see (for example) row 4 and assume that all columns A-AF refer to the same sample. This row actually represents four distinct samples (sample 1 for each of four different collection dates - May 29th, June 12th, June 19th, and June 26th), as well as some calculated summary statistics (an average (avr) and standard error of measurement (SEM)) for two of those samples. Other rows are similarly problematic.

Using multiple tabs

But what about workbook tabs? That seems like an easy way to organize data, right? Well, yes and no. When you create extra tabs, you fail to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or scripting to ensure this connection). Say, for instance, you make a separate tab for each day you take a measurement.

This isn’t good practice for two reasons: 1) you are more likely to accidentally add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and 2) even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyze the data because you will have to combine these data into a single datatable. You will have to explicitly tell the computer how to combine tabs - and if the tabs are inconsistently formatted, you might even have to do it manually.

The next time you’re entering data, and you go to create another tab or table, ask yourself if you could avoid adding this tab by adding another column to your original spreadsheet. We used multiple tabs in our example of a messy data file, but now you’ve seen how you can reorganize your data to consolidate across tabs.

Your data sheet might get very long over the course of the experiment. This makes it harder to enter data if you can’t see your headers at the top of the spreadsheet. But don’t repeat your header row. These can easily get mixed into the data, leading to problems down the road.

Instead you can freeze the column headers so that they remain visible even when you have a spreadsheet with many rows.

Documentation on how to freeze column headers in MS Excel

Not filling in zeros

It might be that when you’re measuring something, it’s usually a zero, say the number of times a rabbit is observed in the survey. Why bother writing in the number zero in that column, when it’s mostly zeros?

However, there’s a difference between a zero and a blank cell in a spreadsheet. To the computer, a zero is actually data. You measured or counted it. A blank cell means that it wasn’t measured and the computer will interpret it as an unknown value (otherwise known as a null value).

The spreadsheets or statistical programs will likely mis-interpret blank cells that you intend to be zeros. By not entering the value of your observation, you are telling your computer to represent that data as unknown or missing (null). This can cause problems with subsequent calculations or analyses. For example, the average of a set of numbers which includes a single null value is always null (because the computer can’t guess the value of the missing observations). Because of this, it’s very important to record zeros as zeros and truly missing data as nulls.

Using problematic null values

Example: using -999 or other numerical values (or zero) to represent missing data.

Solutions:

There are a few reasons why null values get represented differently within a dataset. Sometimes confusing null values are automatically recorded from the measuring device. If that’s the case, there’s not much you can do, but it can be addressed in data cleaning with a tool like OpenRefine before analysis. Other times different null values are used to convey different reasons why the data isn’t there. This is important information to capture, but is in effect using one column to capture two pieces of information. Like for using formatting to convey information it would be good here to create a new column like ‘data_missing’ and use that column to capture the different reasons.

Whatever the reason, it’s a problem if unknown or missing data is recorded as -999, 999, or 0. Many statistical programs will not recognize that these are intended to represent missing (null) values. How these values are interpreted will depend on the software you use to analyze your data. It is essential to use a clearly defined and consistent null indicator. Blanks (most applications) and NA (for R) are good choices. White et al, 2013, explain good choices for indicating null values for different software applications in their article: Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution.

White et al.

Using formatting to convey information

Example: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.

formatting

Solution: create a new field to encode which data should be excluded.

good formatting

Using formatting to make the data sheet look pretty

Example: merging cells.

Solution: If you’re not careful, formatting a worksheet to be more aesthetically pleasing can compromise your computer’s ability to see associations in the data. Merged cells will make your data unreadable by statistics software. Consider restructuring your data in such a way that you will not need to merge cells to organize your data.

Placing comments or units in cells

Example: Your data was collected, in part, by a summer student who you later found out was mis-identifying some of your species, some of the time. You want a way to note these data are suspect.

Solution: Most analysis software can’t see Excel or LibreOffice comments, and would be confused by comments placed within your data cells. As described above for formatting, create another field if you need to add notes to cells. Similarly, don’t include units in cells: ideally, all the measurements you place in one column should be in the same unit, but if for some reason they aren’t, create another field and specify the units the cell is in.

Entering more than one piece of information in a cell

Example: You find one male, and one female of the same species. You enter this as 1M, 1F.

Solution: Don’t include more than one piece of information in a cell. This will limit the ways in which you can analyze your data. If you need both these measurements, design your data sheet to include this information. For example, include one column for number of individuals and a separate column for sex.

Using problematic field names

Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be misinterpreted by parsers that use whitespace as delimiters and some programs don’t like field names that are text strings that start with numbers.

Underscores (_) are a good alternative to spaces. Consider writing names in camel case (like this: ExampleFileName) to improve readability. Remember that abbreviations that make sense at the moment may not be so obvious in 6 months, but don’t overdo it with names that are excessively long. Including the units in the field names avoids confusion and enables others to readily interpret your fields.

Examples




Good Name Good Alternative Avoid
Max_temp_C MaxTemp Maximum Temp (°C)
Precipitation_mm Precipitation precmm
Mean_year_growth MeanYearGrowth Mean growth/year
sex sex M/F
weight weight w.
cell_type CellType Cell Type
Observation_01 first_observation 1st Obs

Using special characters in data

Example: You treat your spreadsheet program as a word processor when writing notes, for example copying data directly from Word or other applications.

Solution: This is a common strategy. For example, when writing longer text in a cell, people often include line breaks, em-dashes, etc in their spreadsheet. Also, when copying data in from applications such as Word, formatting and fancy non-standard characters (such as left- and right-aligned quotation marks) are included. When exporting this data into a coding/statistical environment or into a relational database, dangerous things may occur, such as lines being cut in half and encoding errors being thrown.

General best practice is to avoid adding characters such as newlines, tabs, and vertical tabs. In other words, treat a text cell as if it were a simple web form that can only contain text and spaces.

Inclusion of metadata in data table

Example: You add a legend at the top or bottom of your data table explaining column meaning, units, exceptions, etc.

Solution: Recording data about your data (“metadata”) is essential. You may be on intimate terms with your dataset while you are collecting and analysing it, but the chances that you will still remember that the variable “sglmemgp” means single member of group, for example, or the exact algorithm you used to transform a variable or create a derived one, after a few months, a year, or more are slim.

As well, there are many reasons other people may want to examine or use your data - to understand your findings, to verify your findings, to review your submitted publication, to replicate your results, to design a similar study, or even to archive your data for access and re-use by others. While digital data by definition are machine-readable, understanding their meaning is a job for human beings. The importance of documenting your data during the collection and analysis phase of your research cannot be overestimated, especially if your research is going to be part of the scholarly record.

However, metadata should not be contained in the data file itself. Unlike a table in a paper or a supplemental file, metadata (in the form of legends) should not be included in a data file since this information is not data, and including it can disrupt how computer programs interpret your data file. Rather, metadata should be stored as a separate file in the same directory as your data file, preferably in plain text format with a name that clearly associates it with your data file. Because metadata files are free text format, they also allow you to encode comments, units, information about how null values are encoded, etc. that are important to document but can disrupt the formatting of your data file.

Additionally, file or database level metadata describes how files that make up the dataset relate to each other; what format are they are in; and whether they supercede or are superceded by previous files. A folder-level readme.txt file is the classic way of accounting for all the files and folders in a project.

(Text on metadata adapted from the online course Research Data MANTRA by EDINA and Data Library, University of Edinburgh. MANTRA is licensed under a Creative Commons Attribution 4.0 International License.)

Key Points

  • Avoid using multiple tables within one spreadsheet.

  • Avoid spreading data across multiple tabs.

  • Record zeros as zeros.

  • Use an appropriate null value to record missing data.

  • Don’t use formatting to convey information or to make your spreadsheet look pretty.

  • Place comments in a separate column.

  • Record units in column headers.

  • Include only one piece of information in a cell.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.

  • Record metadata in a separate plain text file.


Dates as data

Overview

Teaching: 7 min
Exercises: min
Questions
  • What are good approaches for handling dates in spreadsheets?

Objectives
  • Describe how dates are stored and formatted in spreadsheets.

  • Describe the advantages of alternative date formatting in spreadsheets.

  • Demonstrate best practices for entering dates in spreadsheets.

Dates are finicky

Dates in Excel

Date formats in excel need special attention. There are many numerous “useful features” which allow them to handle dates in a variety of ways.

There are different ways to visualize a date and as you work on your spreadsheet, the format may change unintentionally due to these features. Example:

Many formats, many ambiguities

Those visualizations of dates are valid for a given spreadsheet program (be it LibreOffice, Microsoft Excel, OpenOffice, Gnumeric, etc.) are usually guaranteed to be compatible only within the same family of products.

Additionally, Excel can turn things that aren’t dates into dates, for example names or identifiers like MAR1, DEC1, OCT4. So if you’re avoiding the date format overall, it’s easier to identify these issues.

mistakes-dates

excel-meme

What you see is not what everyone gets…

Behind the scenes excel stores a date as numbers, which allows you to make calculations with dates as well as making plots

Essentially, it counts the days from a default of December 31, 1899, and thus stores July 2, 2014 as the serial number 41822.

This serial number thing can actually be useful in some circumstances. By using the above functions we can easily add days, months or years to a given date. Say you had a sampling plan where you needed to sample every thirty seven days. In another cell, you could type:

July 2, 2014 = 41822

=B2+37

And it would return

8-Aug

because it understands the date as a number 41822, and 41822 + 37 = 41859 which Excel interprets as August 8, 2014.

Take home message working with excel dates:

All excel cells have a value in them that have a format on top of them which is what you see. This has implications when you want to share your data.

Best Practices in Date Formatting?

There are several solutions to get around this, which depends on the way that you are using dates. As long as it is described and consistent in your date column.

​ YYYY: the full year, i.e. 2015
​ MM: the month, i.e. 03
​ DD: the day of month, i.e. 24
​ hh: hour of day, i.e. 17
​ mm: minutes, i.e. 25
​ ss: seconds, i.e. 35

Such strings will be correctly sorted in ascending or descending order, and by knowing the format they can then be correctly processed by the receiving software.

Balancing the needs of analysis, preservation and data reuse.

If you need to use other date formats during your analysis, make sure to use a best-practice format when sharing your data and exporting final data for preservation (more on exporting later).

ISO 8601 Standard

ISO 8601 is an international standard covering the communication of date and time related data. (Source wikipedia)

iso

Standards in date and time notation are very important. For example the todays date June 15, 2022 is written in America as 06/15/22, while in Canada it is written as 15/06/2022.

Whatever format you decide to use, document it

Documenting the format you use is very important. If you share your data in Excel or another program, the person recieving your data can verify what they see is what you intended if you write down the format.

Example: Ambiguous dates

You have a date in your dataset 11/4/12.

  • In America you may assume this is November 4th, 2012.
  • In Europe you may assume this is April 11th, 2012.
  • In Japan you may assume this is April 12th, 2011

To be FAIR you have to think globally.

dates by country

  • image by Artem Karimov,TopoChecker (2019)

Don’t forget about time zones!

If you record times, document the time zone it is in. Is it your local time zone? What is that? Is it in Universal Time Coordinated (UTC)? More about UTC and GMT

  • UTC time zone is the standard used for instruments on oceanographic cruises. Imagine traveling back and forth across different time zones during your research. You don’t want your time series to jump to different local times! That makes using the data so difficult and sometimes impossible.

Key Points

  • Treating dates as multiple pieces of data rather than one makes them easier to handle.


Exporting data

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How can we export data from spreadsheets in a way that is useful for downstream applications?

Objectives
  • Store spreadsheet data in universal file formats.

  • Export data from a spreadsheet to a CSV file.

Exporting data files

Storing the data you’re going to work with for your analyses in Excel default file format (*.xls or *.xlsx - depending on the Excel version) isn’t a good idea. Why?

As an example of inconsistencies in data storage, do you remember how we talked about how Excel stores dates earlier? It turns out that there are multiple defaults for dates for different versions of the Excel software, and you can switch between them all. So, say you’re compiling Excel-stored data from multiple sources. There are dates in each file - Excel interprets them as their own internally consistent serial numbers. When you combine the data, Excel will take the serial number from the place you’re importing it from, and interpret it using the rule set for the version of Excel you’re using. Essentially, you could be adding errors to your data, and it wouldn’t necessarily be flagged by any data cleaning methods if your ranges overlap.

The CSV format

Storing data in a universal, open, and static format will help deal with this problem. Try tab-delimited (tab separated values or TSV) or comma-delimited (comma separated values or CSV). CSV files are plain text files (free of computer tags, special formatting, and code) where the columns are separated by commas, hence ‘comma separated values’ or CSV.

The advantage of a CSV file over an Excel/SPSS/etc. file is that we can open and read a CSV file using just about any software, including plain text editors like TextEdit or NotePad. Data in a CSV file can also be easily imported into other formats and environments, such as SQLite, R, Python and QGIS. We’re not tied to a certain version of a certain expensive program when we work with CSV files, so it’s a good format to work with for maximum portability and endurance.

example-csv

Most spreadsheet programs can save to delimited text formats like CSV easily, although they may give you a warning during the file export.

To save a file you have opened in Excel in CSV format:

  1. From the top menu select ‘File’ and ‘Save as’.
  2. In the ‘Format’ field, from the list, select ‘Comma Separated Values’ (*.csv).
  3. Double check the file name and the location where you want to save it and hit ‘Save’.

An important note for backwards compatibility: you can open CSV files in Excel!

Saving an Excel file to CSV

Dealing with comma’s


When talking about exporting data we discussed how to export Excel file formats into csv. Comma Separated Value files are indeed very useful and allow for easily exchanging and sharing data.

However, there are some significant problems with this particular format. Quite often the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. This is because the commas which are a part of the data values will be interpreted as delimiters.

For example, our data might look like this:

	species_id,genus,species,taxa
	AB,Amphispiza,bilineata,Bird
	AH,Ammospermophilus,harrisi,Rodent-not,censused
	AS,Ammodramus,savannarum,Bird
	BA,Baiomys,taylori,Rodent

In the record AH,Ammospermophilus,harrisi,Rodent-not,censused the value for taxa includes a comma (Rodent-not,censused). If we try to read the above into Excel (or other spreadsheet program), we will get something like this:

Issue with importing csv format

The value for taxa was split into two columns (instead of being put in one column D). This can propagate to a number of further errors. For example, the extra column will be interpreted as a column with many missing values (and without a proper header). In addition to that, the value in column D for the record in row 3 (so the one where the value for ‘taxa’ contained the comma) is now incorrect.

If you want to store your data in csv format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values in quotes (“”). Applying this rule, our data might look like this:

species_id,genus,species,taxa
"AB","Amphispiza","bilineata","Bird"
"AH","Ammospermophilus","harrisi","Rodent-not, censused"
"AS","Ammodramus","savannarum","Bird"
"BA","Baiomys","taylori","Rodent"

Now opening this file as a csv in Excel will not lead to an extra column, because Excel will only use commas that fall outside of quotation marks as delimiting characters. However, if you are working with an already existing dataset in which the data values are not included in “” but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning.

If the dataset you’re dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - “”) is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.

Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.

Key Points

  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.