Research Data Management

Research Data Management is really important, but it is also a bit of a misnomer and very poorly explained in most places. 

Its origins are in two separate concerns:

  • Experimental data gathered at great public expense should be made openly available, after the original creators have had a head start in exploiting it for publications.

  • It should be much easier for researchers to compare their work quantitatively to others, to check whether the results printed in papers can indeed by obtained by the stated mathematical method, and to see whether the results of new research are better, for example in terms of accuracy.

In Maths, it is the second of these two concerns which is the more important one for almost everyone. 

As a result of these concerns, EPSRC and other funding bodies have brought in new rules (effective May 1, 2015 for EPSRC-funded research) requiring authors of journal articles and conference papers to make their "underlying data" openly available, after a potential embargo period. 

For typical Maths papers, this means the following: for each plot, unless it is a very simple function specified in the paper, the authors must provide either a file with the relevant tabulated data, or the software which generated the data and plot. The collected data forms a dataset which must be openly available and archived for at least 10 years using a permanent DOI label, and this DOI must be provided in the paper, typically in the Acknowledgements section after acknowledging EPSRC funding. 

A few Maths researchers, including perhaps those working on Big Data, may write papers based on public datasets (e.g. Twitter) or private datasets (e.g. stock market datafeeds). These researchers can talk to Mike Giles about the new rules. 

If you would like to read more about RDM, here are some links:

Note that the University RDM policy is effectively extending the rules to all research, not just that funded by EPSRC. For example, it has been suggested that DPhils will be required to deposit their "data" along with their dissertation as a requirement for graduation. 
 


Dataset format

EPSRC does not specify the format in which you should store the data, so you should apply common sense. 

Remember that the data is to be archived for a 10 year period; over that timescale, any proprietary format (such as an Excel spreadsheet, or a MATLAB code) may no longer be usable, although it may be the most helpful format in the short term. Also, not everyone will have access to the required proprietary software. 

Consequently, the advice is to also include the simplest possible CSV (comma separated values) text file with the tabulated data. 

You can upload individual files into the ORA-data depository, but it may be simpler to create a Zip file holding everything, including a README text file which identifies the contents, and then there's just one file to upload. 
 


Depositing the dataset in ORA-data

Having created a dataset, it has to be uploaded to ORA-data, part of the Oxford Research Archive where it will be held by the university for at least 10 years and given a permanent DOI label. 

To do this you must:

  • go to http://ora.ox.ac.uk/information/contribute and click on Data (or simply click on this second link)

  • after going through the SSO authentication (if necessary) fill in the information which is required, and upload the dataset

  • for "Title", you might like to use "Data for paper 'xxxxxxxx' "

  • you can ignore all optional fields (not marked with a red *)

  • for "documentation about your dataset ...", it's simplest to put "see README file in dataset

  • students/postdocs should also add the PI of the research project as a "creator" since this person is likely to remain longest in the university and therefore can answer any future questions about the dataset

  • for "Related publications", this is where you will be able to give information about the article in which this data will appear; if you don't know some of the information now it is not a problem because you will be able to update this later

  • for "Archive service payment", tick "Payment is not required" -- the service is currently free for a trial period but eventually the dept will need to pay somehow

  • for "Data Steward", specify Waldemar Schlackow (it should auto-complete if you type in Schlackow) and specify his role as "Information/Data Manager".

  • for "Access conditions", it seems there is no choice except to make it immediately available.

I think these are the key elements which people may have questions about, but if there are others please let me know and I will update this guidance. 
 


When to deposit

This is a slightly more difficult issue than one might expect. 

The problem is that the paper has to reference the DOI of the dataset, but once a dataset has been given a DOI you cannot change the dataset; you can only update some aspects of its metadata, such as the DOI of the paper it is associated with. 

The advice therefore is as follows:

  • put a sentence like "In compliance with EPSRC's open access inititive, the data in this paper is available from http://dx.doi.org/xxx/xxx." in the Acknowledgements in the draft paper as it goes through the journal reviewing process

  • when the paper is accepted, generate and deposit the dataset, and edit the paper text to give the correct dataset DOI; this is the AAM (authors accepted manuscript) which you will also need to deposit in ORA to satisfy HEFCE's open access requirements, and with most journals this is the point at which you submit the source files for the paper so this should be OK, but with some journals you may need to make this change at the publication proof stage

  • when the paper finally appears in print or online, update the dataset metadata to give the full details of the paper, including its DOI; it's possible that in the future the ORA-data system will be able to do this automatically

An example

The example is a conference paper which can be downloaded from here

The "data" in this paper consists of 3 figures; figure 1 is purely illustrative and has no real content, and the data in the other 3 figures is generated and plotted by software written in MATLAB. 

It would be sufficient for EPSRC purposes to do either of the following:

  • provide text files which tabulate the data in each of the figures
  • provide the software which generates and plots each of the figures

In this case, both of these were done, and a zip file dataset was created which contains the figures, the tabulated data and the MATLAB codes, as well as a simple README text file which explains what is what.  The dataset was then deposited and can be viewed by following this link

In the paper, note the data access statement in the Acknowledgements. There is some flexibility on the precise wording, but the key is that it should specify the DOI for the dataset.