Research Data Management

Research Data Management is really important, but it is also a bit of a misnomer and very poorly explained in most places.

Its origins are in two separate concerns:

Experimental data gathered at great public expense should be made openly available, after the original creators have had a head start in exploiting it for publications.

It should be much easier for researchers to compare their work quantitatively to others, to check whether the results printed in papers can indeed by obtained by the stated mathematical method, and to see whether the results of new research are better, for example in terms of accuracy.

In Maths, it is the second of these two concerns which is the more important one for almost everyone.

As a result of these concerns, EPSRC and other funding bodies have brought in rules (effective May 1, 2015 for EPSRC-funded research) requiring authors of journal articles and conference papers to make their "underlying data" openly available, after a potential embargo period.

For typical Maths papers, this means the following: for each plot, unless it is a very simple function specified in the paper, the authors must provide either a file with the relevant tabulated data, or the software which generated the data and plot. The collected data forms a dataset which must be openly available and archived for at least 10 years using a permanent DOI label, and this DOI must be provided in the paper, typically in the Acknowledgements section after acknowledging EPSRC funding.

A few Maths researchers, including perhaps those working on Big Data, may write papers based on public datasets (e.g. Twitter) or private datasets (e.g. stock market datafeeds). These researchers can contact the open access coordinator (Patrick Farrell) about the rules.

If you would like to read more about RDM, here are some links:

Oxford University Research Data website

Oxford University Policy on the management of research data and records

EPSRC's own guidance on its Expectations

Note that the University RDM policy is effectively extending the rules to all research, not just that funded by EPSRC. For example, it has been suggested that DPhils will be required to deposit their "data" along with their dissertation as a requirement for graduation.

Dataset format

EPSRC does not specify the format in which you should store the data, so you should apply common sense.

Remember that the data is to be archived for a 10 year period; over that timescale, any proprietary format (such as an Excel spreadsheet, or a MATLAB code) may no longer be usable, although it may be the most helpful format in the short term. Also, not everyone will have access to the required proprietary software.

Consequently, the advice is to also include the simplest possible CSV (comma separated values) text file with the tabulated data.

You can upload individual files, but it may be simpler to create a Zip file holding everything, including a README text file which identifies the contents, and then there's just one file to upload.

Depositing the dataset in Symplectic

Having created a dataset, it has to be uploaded to the Oxford Research Archive where it will be held by the university for at least 10 years and given a permanent DOI label. You used to upload directly to ORA, but as of 23 May 2023 the only way to upload to ORA is via Symplectic. To do so, please follow the instructions on the Bodleian's Open Access webpage.

When to deposit

This is a slightly more difficult issue than one might expect.

The problem is that the paper has to reference the DOI of the dataset, but once a dataset has been given a DOI you cannot change the dataset; you can only update some aspects of its metadata, such as the DOI of the paper it is associated with.

The advice therefore is as follows:

put a sentence like "In compliance with EPSRC's open access inititive, the data in this paper is available from http://dx.doi.org/xxx/xxx." in the Acknowledgements in the draft paper as it goes through the journal reviewing process

when the paper is accepted, generate and deposit the dataset, and edit the paper text to give the correct dataset DOI; this is the AAM (authors accepted manuscript) which you will also need to deposit in ORA to satisfy HEFCE's open access requirements, and with most journals this is the point at which you submit the source files for the paper so this should be OK, but with some journals you may need to make this change at the publication proof stage

when the paper finally appears in print or online, update the dataset metadata to give the full details of the paper, including its DOI.

An example

The example is a conference paper which can be downloaded from here.

The "data" in this paper consists of 3 figures; figure 1 is purely illustrative and has no real content, and the data in the other 3 figures is generated and plotted by software written in MATLAB.

It would be sufficient for EPSRC purposes to do either of the following:

provide text files which tabulate the data in each of the figures
provide the software which generates and plots each of the figures

In this case, both of these were done, and a zip file dataset was created which contains the figures, the tabulated data and the MATLAB codes, as well as a simple README text file which explains what is what. The dataset was then deposited and can be viewed by following this link.

In the paper, note the data access statement in the Acknowledgements. There is some flexibility on the precise wording, but the key is that it should specify the DOI for the dataset.