Oxford DMPonline Progress Report, December 2012

At this stage in the Oxford DMPonline Project, the principal developments required to achieve the envisaged Oxford DMPonline service have been undertaken.  Over the next several weeks they will be integrated into a coherent whole, customized for Oxford University researchers to use, and tested.

A detailed progress report on the project is available at http://imageweb.zoo.ox.ac.uk/pub/2013/reports/Oxford_DMPonline_Progress_Report_Dec2012.docx.

Posted in Uncategorized | Leave a comment

DMP questions – comparisons and conclusions

This blog post compares sets of data management planning questions, following their alignment in a spreadsheet whose creation is described in the previous blog post, and draws general conclusions about such question sets.

Characteristics of each set of questions

The following data management planning question sets are aligned in the comparison spreadsheet, which is downloadable from here.

DCC’s DMP Online questions (Checklist version 3.0, dated 2 Feb 2012):

The DCC uses a large set of questions for its DMPonline system, 94 questions in all, which have been carefully ordered in a number of distinct sections. Great care has been taken to eliminate duplication of questions soliciting information about the same thing, but there are still a few compound questions that demanded multiple answers. If the DMP being created is to accompany a grant application, the questions asked of the user are tailored to the funding agency’s requirements. Once a project has been funded, the user is given access to the full set of questions to enable more detailed planning.

Areas that are covered in detail by the DCC’s DMPonline questions, but that have scant representation in the other question sets, include:

    • Resourcing
    • Plan adherence
    • Policy
    • Funders and budget
    • Use of existing data
    • Ethical issues
    • Likely reuse
    • Security

Many of the DCC’s questions ask why people are doing certain things, requiring discursive answers.

David Shotton’s Twenty Questions for Research Data Management:

This question set is designed to be concise, and deliberately omits metadata questions (e.g. “Who is the author of this DMP?”). Most of the questions map directly onto the DCC’s DMP questions, but there are three unique questions relating to methods and to who decides about data triage. To guide the user, each question is accompanied by exemplar short possible responses.

DMP Tool (USA):

Rather than give a prescriptive list of questions, the DMP Tool developed in the USA has a less structured approach, presenting the user with a small of sections to complete, each of which has a text box in which the user can enter free text on that particular topic, with guidance text and suggested questions to consider being provided in the accompanying Help Box. Most of the questions have been cherry picked from the DCC’s question set, as part of a collaboration between the two development project, with a few new ones added. The exact number of sections and questions suggested vary according to which funding agency has been selected.

The following text, taken from the National Science Foundation Generic (NSF-GEN) option, is an example of the wording provided in a Help Box, this one for Section 1: Types of data produced:

Give a short description of the data, including amount (if known) and content. If the project will be collecting data of a sensitive nature, note here and reflect upon it in subsequent sections. Data types could include text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, etc. Consider these questions:

      • What data will be generated in the research?
      • What data types will you be creating or capturing?
      • How will you capture or create the data?
      • If you will be using existing data, state that fact and include where you got it.
      • What is the relationship between the data you are collecting and the existing data?

For simplicity, it was only the specific bulleted questions from the DMP Tool that I entered into my comparison spreadsheet. As a consequence, the full extent of the DMP Tool’s requested information is under-represented. For example, the DMP Tool column of the spreadsheet does not contain the specific question “What amount of data will be generated”, although that question is implied in the statement given above: “Give a short description of the data, including amount (if known) and content.”

DataTrain Questions for Post-Graduate Research Projects:

This question set is designed specifically for use by graduate students commencing their research projects. As such, it contains no questions about funders or resourcing. It is the smallest of those sets being compared, containing just six metadata questions and six data questions (one if which is actually about the student’s thesis). It contains no questions about short-term data storage and backup, nor about data publication, and as such is insufficient for creating an effective DMP.

Jez Cope’s Questions for Post-Graduate DMPs (University of Bath):

This question set is also designed specifically for use by graduate students commencing their research projects, but is more comprehensive than the DataTrain set, containing seven basic metadata questions and (coincidentally) twenty questions about the data. Uniquely, it contains five questions that no-one else had thought to ask, about actions to be undertaken as a result of creating the DMP, about the frequency of data acquisition and the volume of data, and about versioning of data files.

Comparison of the question sets

After aligning the different sets of data management planning questions in the spreadsheet as described in the previous blog post, I was able to compare them more easily.

First, I looked down each set of questions and marked

      • in green those questions that were compound, demanding two or more answers;
      • in plum those duplicate questions that asked for the same or similar information in different ways; and
      • in blue questions that were unique to each of the smaller sets, using the DCC questions as the primary comparator.

The numbers of such questions in each set are shown in the following table:

Source

Plan questions

Project questions

Data questions

Compound questions

Duplicate questions

Unique questions

DCC DMPonline

17

14

63

3

0

(Reference)

Shotton (Oxford)

0

0

20

0

0

3

DMP Tool (NSF)

12

0

30

5

5

3

DMP Tool (NIH)

9

0

8

1

2

2

DataTrain

4

2

6

1

0

1

Jez Cope (Bath)

5

2

20

1

2

5

Compound and duplicate questions

One thing that quickly became clear while I was creating my own Twenty Questions for Research Data Management was how easy it was inadvertently to ask compound questions requiring two or more answers, or to ask for the same information twice in slightly different ways. By revising my own questions, splitting compound questions into pairs of single questions, and rewording and combining others, I managed to eliminate these problems from the published Twenty Questions, but similar problems are present in the other question sets.

Examples of compound questions:

      • DCC question 6.3.2: How will this metadata/documentation be created, and by whom?
      • DMP Tool: Who will hold the intellectual property rights to the data and how might this affect data access?
      • Jez Cope: What should/shouldn’t be shared and why?

Examples of duplicate questions in the DMP Tool (NSF):

      • What contextual details (metadata) are needed to make the data you capture or collect meaningful?
      • What metadata/documentation will be submitted to make the data reusable?
      • What related information will be deposited?

Examples of duplicate questions in Jez Cope’s set:

      • Who else should reasonably have access to your data?
      • Who should have access and under what conditions?

Institution-specific questions

There were two institution-questions that were impossible to understand outside the specific context in which they were asked.

Institution-specific questions in the DMP Tool set:

Solicitation number

Institution-specific questions in the DataTrain set:

3 Has a ‘File Structure/Naming Form’ been completed?

Unique questions

Surprisingly, given the number and variety of questions comprising the DCC’s question set, each of the other smaller question sets contained metadata questions or data questions that were unique. Jez Cope’s question set was remarkably in containing five unique questions – two unique metadata questions out of a total of seven, and three unique data questions out of a total of twenty – questions that no-one else had thought to include in their sets.

All the unique questions are listed below:

Unique data questions from Shotton’s Twenty Questions:

(note – question numbers changed to match updated 20 Questions)

5 When and where will you describe each of your research datasets?

Possible responses:

  • The only description will be the filenames on my hard drive.
  • I will describe the data using handwritten notes in my lab notebook if and when I have time, after the experiments have been completed – hopefully I’ll be able to remember all the details
  • I will describe the data using the column and row labels in my spreadsheets after the data have been analysed.
  • I will create descriptive metadata for each dataset as I create/acquire it, and will save these descriptions with my datasets on my hard drive.

18 Who will decide which of your research data are worth preserving?

Possible responses:

  • Myself alone.
  • Myself, in consultation with my research supervisor.
  • My research supervisor alone.

19 How (i.e. by what physical or electronic method) will you transfer your research datasets to their long-term archive, under the curatorial care of a separate third-party, e.g. a data repository?

Possible responses:

  • On physical hard drives that I will bring back from my field site by air.
  • By e-mailing files to our librarian.
  • By completion of the Web-based database submission form and uploading of the data files over the Internet.
  • By automated data packaging and repository submission over the Web from my local DataStage filestore, using the SWORD repository submission protocol.

Unique metadata questions from DMP Tool (NSF-BIO and NIH):

Plan name

Comment

Unique data questions from DMP Tool (NSF-BIO):

What data types will you be creating or capturing?

What procedures does your intended long-term data storage facility have in place for preservation and backup?

Unique metadata questions from DataTrain:

Version (number)

Date amended

Unique metadata questions from Jez Cope:

What actions have you identified from the rest of this plan?

What further information do you need to carry out these actions?

Unique data questions from Jez Cope:

How often do you get new data?

How much data do you generate?

What different versions of each data file do you create?

Because of its size, the DCC’s question set contained many questions that are not present in the other sets, or that are on topics represented by fewer questions. (But see remarks above about the DMP Tool questions.) Of these unique DCC questions, the most significant are listed below. Note how some of these relate to policy issues or administrative issues about which a typical research scientist is likely to have little knowledge, which would thus be difficult and time-consuming to answer satisfactorily:

Unique metadata questions from DCC’s DMPonline:

1.4.2 Aims and purpose of this plan

1.4.3 Target audience for this plan

1.3.1 Funding body requirements relating to the creation of a data management plan

1.3.2 Institutional or research group guidelines

1.3.3 Other policy-related dependencies

7.2 How will data management activities be funded during the project’s lifetime?

7.3 How will longer-term data management activities be funded after the project ends?

8.2.2 Who will carry out reviews?

Unique data questions from DCC’s DMPonline:

2.2.1 Have you reviewed existing data, in your own institution and from third parties, to confirm that new data creation is necessary?

2.2.3 Describe any access issues pertaining to the pertinent existing data

2.3.1 Why do you need to capture/create new data?

2.3.4 What criteria will you use for Quality Assurance/Management?

2.4.2 How will you manage integration between the data being gathered in the project and pre-existing data sources?

2.4.3 What added value will the new data provide to existing datasets?

3.1.3 Is the data that you will be capturing/creating “personal data” in terms of the Data Protection Act (1998) or equivalent legislation if outside the UK?

3.1.4 What action will you take to comply with your obligation under the Data Protection Act (1998) or equivalent legislation if outside the UK?

3.2.1 Will the dataset(s) be covered by copyright or the Database Right?

3.2.4 For multi-partner projects, what is the dispute resolution process / mechanism for mediation?

4.2.1 Does the original data collector/ creator/ principal investigator retain the right to use the data before opening it up to wider use?

4.3.1 Which groups or organisations are likely to be interested in the data that you will create/capture?

4.3.2 How do you anticipate your new data being reused?

5.3.1 How will you manage access restrictions and data security during the project’s lifetime?

5.3.2 How will you implement permissions, restrictions and/or embargoes?

5.3.3 Give details of any other security issues.

6.2.1 Will or should data be kept beyond the life of the project?

6.2.5 On what basis will data be selected for long-term preservation?

6.2.7 Will transformations be necessary to prepare data for preservation and/or data sharing?

6.3.5 How will you address the issue of persistent citation?

6.3.3 Will you include links to published materials and/or outcomes?

Missing data management planning questions

Some important questions were not asked by anyone!

Missing metadata questions:

Who is the Principal Investigator of the research project to which this DMP relates?

What is the Principal Investigator’s department?

Does the research described in this plan require approval by an ethics committee?

Is this DMP confidential, or will it be made public after a positive funding decision on the grant application for the project to which it relates?

What is the unique identifier for this DMP?

Missing data questions:

Do/will the datasets described in this DMP contain experimental or observational data that it would be impossible to re-acquire or re-collect (e.g. sub-atomic particle disintegration data, seismic records, animal behaviour observations, astronomical or meteorological data, questionnaire responses)?

Are metadata created automatically for any of your data (e.g . time, date and geo-location information captured in the Exif header of an image file)?

If so, please define the data types and specify the nature of the accompanying metadata.

To what published journal article(s) do the data covered by this DMP relate?

Conclusions

      1. Creating an appropriate question set for DMPs is difficult work, since there are many possible questions one could ask about data management.
      2. Care needs to be taken to avoid asking ambiguous questions and questions that require more than one answer, and to avoid asking for the same or similar information multiple times in different questions.
      3. Despite the comprehensiveness of the DCC’s DMPonline question set, each of the other question sets has unique questions not covered by the DMPonline set.
      4. All of the available question sets have drawbacks, and some have unique strengths.
      5. In terms of comprehensiveness, the best may be the enemy of the good enough.
      6. Further work needs to be done by the community as a whole to build on the work undertaken so far, and to devise and standardize the best possible set of questions for different constituencies of user, e.g.
        1. Applicants for research grants, who need concise DMPs tailored to funders’ requirements, including the possibility of including standardized institutional answers to certain questions (for example about data backup facilities and the institutional data repository).
        2. Students and researchers starting research projects on existing funding, who have no need for questions about funders and budgets, but who need to be encouraged to think about the nitty-gritty of their own data management tasks.
      7. There will always be a need for individuals or institutions to be able to add their own specific questions to a question set, questions that are particular to their situation and could not have been anticipated by those devising the question set.

This last point relates to the functionality of the software tools that permit answers to data management planning questions to be collected and a formal data management plan to be created. I will compare these tools in a subsequent post.

Posted in Data management planning, JISC | Tagged , , , , , , , , , | 6 Comments

DMP questions – description and alignment

This blog post described various sets of data management planning questions from different sources and their alignment in a spreadsheet to permit their comparison. The following post will compare these different sets of data management planning questions, and will suggest conclusions that can be drawn from such a comparison. Subsequently, I will compare various software tools that might be used to create data management plans (DMPs) based on such questions.

These discussions should be considered in the context of true purpose of data management planning: not to fulfill the requirements of research funders, but to better manage the research data flowing from research projects – which is the funders’ aim in the first place when requesting submission of data management plans to accompany grant applications.

Background

As an exercise in ‘drinking my own champagne’ when applying to the JISC for funding of the Oxford DMPonline Project in July last year, I used the Digital Curation Centre (DCC)’s DMPonline Tool to create a data management plan (DMP) to accompany my grant application. Here is the first paragraph of that eight-page plan, the full version of which is available here.

In response to this first exposure to the DCC’s questions, I had the following thoughts:

  1. That the whole questionnaire was longwinded, often requiring discursive answers, e.g.

    Question 2.4.2 “How will you manage integration between the data being gathered in the project and pre-existing data sources?”

    Question 2.5.5Why have you chosen particular standards and approaches for metadata and contextual documentation?”

  2. That its primary point of view was that of a data administrator, not a researcher, e.g.

    Question 1.3.3 “(Define) Other policy-related dependencies.”

    Question 6.4.2In the event of the long-term place of deposit closing, what is the formal process for transferring responsibility for the data?

    Question 7.1Outline the staff/organisational roles and responsibilities for implementing this data management plan.

    Question 8.1.1How will adherence to this data management plan be checked or demonstrated?

  3. That researchers were unlikely to know the answers to several of the questions, and would thus have to invest considerable time and effort consulting experts in data management to discover them.
  4. That, even discounting this, it would a longer time to complete the many detailed questions asked that most researchers would be prepared to give to the task.
  5. That the eight-page output that contained my own DMP was far longer than could normally be submitted to accompany a grant application – funders typically have a limit of one or two pages for DMPs.
  6. Thus, even if the DMP was downloaded in an editable format, rather than the default PDF format, substantial further work would be required to change the DMPonline Tool output into a submissable data management plan.

I concluded that the average researcher would not find the DCC’s DMPonline question set easy to use, and that uptake was likely to be limited (see Footnote).

Others’ experiences when using the DCC’s DMPonline Tool have been similar to my own. Jez Cope gave different groups of University of Bath graduate students one hour to complete the DCC’s DMPonline questionnaire, or one of three alternative question sets. He reports at http://blogs.bath.ac.uk/research360/2012/03/rdm101-data-management-planning/ on their reaction to using the DMPonline Tool:

  • “The students were immediately put off by the amount of detail they were asked to input”
  • “On a positive note, they definitely felt that this was the most comprehensive template!”
  • “None of the students using this template got anywhere near to finishing the questions within the time.”
  • “The students reported that very little of what they were being asked felt relevant to them.”
  • “They said that for at least some of the questions it was difficult to understand what they were being asked for.”

Twenty Questions for Research Data Management

Recently, many weeks after I had completed the DMP submitted with my Oxford DMPonline grant application, when I had forgotten the specific questions asked by the DCC’s DMPonline tool, but still retained an uneasy feeling that they were too detailed, I sat down with a blank sheet of paper to write down the most important questions I could think of concerning research data management, from the researcher’s point of view. I reckoned that most people could manage twenty questions without giving up, particularly if I also provided some short example responses.

In my first draft, the resulting Twenty Questions for Research Data Management were arranged under the six headings What? Where? How? When? Who? and Why? Jez Cope kindly gave that question set a test drive with another group of his University of Bath graduate students, at the same time as the first group were answering the DCC’s question set. In the same blog post, he reports these students’ reactions to Twenty Questions for Research Data Management:

  • “These questions were considered to be mostly relevant and easy to understand, and the students had no problem completing them in the time available. The example responses made it easier to understand what was required for each question.”

Jez gave me valuable feedback about the poor ordering of that original draft of the Twenty Questions, due to the constraints imposed by the headings I had chosen:

  • “Because they were arranged under What, Where, etc., the students found it difficult at times to see how the questions related to each other. Perhaps because of this, the students were undecided as to whether it (the question set) was comprehensive enough.”

Since then, I have put the Twenty Questions through two revisions. The first re-ordered them into a more logical progression, under the new headings:

  • The nature of your data
  • Date descriptions (metadata, “data about data”)
  • Data sharing
  • Data storage and backup
  • Data archiving
  • Data publication
  • Future data management

The second revision simplified the wording, removed some redundancy between questions, and split compound questions into single questions. To keep the total number of questions to twenty, I removed two questions about when data would be collected and analyzed, that I considered of lesser importance. The resulting Twenty Questions for Research Data Management, with examples of possible responses for each question, were published in an earlier blog post, and are available as a downloadable Word file here.

Different sets of data management planning questions

Having completed my own Twenty Questions for Data Management Planning, I thought it would be useful to compare them side-by-side with the other English-language question sets of which I had knowledge:

  • The DCC DMPOnline Checklist dated 2 Feb 2012, kindly sent to me by Martin Donnelly in the form of an Excel spreadsheet. (A PDF version of this is available here.)
  • Two versions of the questions used in the DMP Tool developed in the USA to create DMPs tailored for US funding agencies) – I chose to use those questions suggested for NSF-BIO and for the NIH, which were fewer in number.
  • The DataTrain Questions for Post-Graduate Research Projects, available from the Archaeological Data Service here.
  • Jez Cope’s own draft set of University of Bath Questions for Post-Graduate DMPs, available here.

The largest corpus of questions, from the DCC, are individually numbered, and structured under the following headings:

  • Metadata about the DMP itself (unnumbered questions)
  • Section 1: Introduction and Context
  • Section 2: Data Types, Formats, Standards and Capture Methods
  • Section 3: Ethics and Intellectual Property
  • Section 4: Access, Data Sharing and Reuse
  • Section 5: Short-Term Storage and Data Management
  • Section 6: Deposit and Long-Term Preservation
  • Section 7: Resourcing
  • Section 8: Adherence and Review

Since they don’t actually ask questions, I omit from the following discussion the final sections of the DCC corpus:

  • Section 9: Statement of Agreement
  • Section 10: Annexes

Re-ordering the DCC’s DMPonline questions

To enable subsequent comparison between different question sets, I first re-ordered the DCC’s questions so as to provide a clear separation between three different types of question:

  • First those seeking metadata about the data management plan itself.
  • Then those eliciting information about the research project to which the plan relates.
  • Finally those questions that are concerned with managing the data per se.

To these, since it was required by one of the other question sets, I added a fourth category into which I moved two of the DCC’s questions:

  • Questions about related documents.

With expanded headings, the new order of the DCC’s questions used in my alignment spreadsheet is as follows:

Questions about the data management plan

  • Personal details of the plan creator (DCC questions lacking a number)
  • About this data management plan (DCC Section 1.4)
  • Data management resourcing (DCC Section 7)
  • Plan adherence and revision (DCC Section 8)

Questions about the research project

  • Research project details DCC question 1.1.4)
  • Project participants (DCC questions 1.1.5, 1.1.6 and 10.1)
  • Research funding (DCC questions 1.1.2 and 1.1.3)
  • Data management policies (DCC Section 1.3)

Questions about the research data

  • Research area (DCC Section 1.2)
  • The nature of your data (DCC Sections 2.1 to 2.4)
  • Creating data descriptions (metadata, “data about data”) (DCC Sections 2.5 and 3)
  • Data sharing – person to person (DCC Section 4)
  • Short-term data storage and backup (DCC Section 5)
  • Data archiving (DCC Sections 6.1 to 6.2.5)
  • Data publication (DCC Sections 6.2.6 to 6.4)

Questions about related documents (DCC questions 6.3.3 and 6.3.4)

Alignment of the data management planning questions from different sources

I then entered the questions from the other sets into different columns in the Excel spreadsheet, re-ordering them against the DCC questions so that similar questions were aligned across the page in the same row. This revealed which questions were common to several sets, and which were unique.

The Excel spreadsheet containing the aligned data management planning questions can be downloaded from here.

A comparison of the question sets was made easier by this alignment, and is described in the following blog post.

[Footnote: Adrian Richardson reported on 23 March 2012 that there are some 1000 DMPs now lodged in the DCC’s database, although some of these, like six of the seven DMPs saved under my own name, are likely to be test submissions to try out the system, rather than genuine DMPs.

For those DMPs designed to accompany grant applications, we need to put this number into context. In the most recent year for which statistics are available, AHRC received 957 grant applications, BBSRC 1,832 applications, EPSRC 2,568, ESRC 905, MRC 1,377, NERC 1,361 and STFC 415, giving a current annual total of 9,415 RCUK grant applications. I was unable to find current figures for the number of applications funded by UK medical charities, but these are likely to add at least 2,000 to the annual total.]

Posted in Data management planning, JISC | Tagged , , , , , , , | 4 Comments

Oxford DMPonline Project update (March 2012)

Since the start of the Oxfod DMPonline Project, we’ve been working on a number of things in preparation for the main body of work, which will (a) involve embedding SWORDv2 into the DCC’s DMPonline system, so that its outputs can form a SWORD submission, and (b) customizing the DCC’s DMPonline system for use  in the University of Oxford.  That main body of work has been awaiting the release of v3 of the DCC system,that was scheduled for the end of February.

This post is a round-up of our activities and where we are going next.

  1. Richard had an initial meeting at the DCC to discuss the approach we will should take to embedding SWORDv2 into their software.  From this, we have an outline design for the integration which will be published soon.
  2. David and Richard attended a JISC programme meeting in London, bringing together our project with the related Data Management Planning for Secure Services Project (DMP-SS), our JISC Programme Manager Simon Hodson, and the DCC’s DMPOnline personnel Adrian Stevenson and Marin Donnelly.  Hosted at the Institute of Child Health by the DMP-SS PI Tito Castillo, this day gave us a good opportunity to discuss the DMPOnline software, to share with the DCC folk some of our concerns and suggestions for version 3.0 of the online tool, and get feedback from the DCC on our planned approaches.  See Tito’s blog post about this day.
  3. David conducted a user-centric review and Richard a technical review of the DCC’s DMPOnline software in comparison with and the US equivalent DMP Tool.  We had planned to publish this some time ago, but still need to do a little more work on it.  The comparative evaluation will this be published shortly.
  4. David attended the Data Management Planning workshop at the 7th International DCC Conference in Bristol, which was also attended by representatives of the US DMP Tool, and was able to have further useful discussions.
  5. David joined euroCRIS and has communicated extensively with members of the the euroCRIS CERIF Task Force and the Linked Data Task Force of euroCRIS on the appropriate approaches to linked data and ontologies to represent CERIF, the Common European Research Information Framework that is being strongly supported by the JISC.  This culminated recently in his participation in the CERIF Linked Data workshop in Bath, and his presentation of CERRO, the CERIF Roles and Relationships Ontology, developed with his ontologist colleague Silvio Peroni.  His presentation on that topic is available here.
  6. David has also done some initial work around developing the CERIF-compliant RDF metadata descriptors required to describe and accompany a Data Management Plan.  More thought is required on the best way to proceed with this development, about which we will report in due course.  In the mean time, see his Bath presentation on this topic here.
  7. David met with Oxford colleagues and Jez Cope from the University of Bath to discuss teaching data management skills to graduate students.  This led to his formulation of Twenty Questions for Research Data Management, and an evaluation of them in comparison with other questions on this topic, including those used by the DCC’s DMPOnline tool, in a ‘road test’ by graduate students at the University of Bath, described in Jez’s blog post.

In the immediate future, we will be working on:

  1. Completing and publishing the comparison of the DCC’s DMPOnline tool with the US DMP Tool.
  2. Following Jez Cope’s initial comparison by creating and publishing a side-by-side comparison of the Twenty Questions for Research Data Management with the questions used by the above two tools, and by two other sources, to see what common themes and recommendations emerge.
  3. Deciding how best to create CERIF-compatible metadata for DMPs, and undertaking the necessary ontology development for that.
  4. Initialising the development of the Ruby code required to embed SWORD into the DCC’s DMPOnline environment.  At the moment we are waiting for the DMPOnline 3.0 software to be released as a service and made available Open Source.  Unfortunately, while waiting for this, there is not much we can do on the core of the project, in terms of embedding in Oxford and developing software extensions.
  5. Designing the full infrastructure required at the University of Oxford to embed the DMPOnline service there.  Integration at Oxford will require customisations to the UI for DMPOnline, as well as integration with single-sign-on and the addition of custom questions sets to the DMPOnline instance.  It also requires some integration with the Grants and Research Contracts management software at Oxford, to scope which a meeting with our colleagues in Research Services is being scheduled.
  6. Creating customisations to the DMPOnline software and questions to support the Oxford use cases.   To support (2) above, we will need to create the website skin with an ‘Oxford’ look and feel, and design the questions and supporting help text for the Oxford DMPonline.

We’ll be posting more updates to the blog as they become available.  In the mean time, please do not hesitate to contact us:

David Shotton <david.shotton@zoo.ox.ac.uk> P:01865-271193;

Richard Jones <richard@cottagelabs.com>

Posted in JISC | Tagged , , , , , , , , , , , | Leave a comment

Twenty Questions for Research Data Management

A Web  entry form that permits creation of a data management plan using these questions is now available at http://www.miidi.org/dmp/.

[Notes: These questions were revised on 22 March 2012 and again on 11 June 2012.   Further changes to improve the clarity of the questions were made on 9 May 2013  – see Footnote.  

This document is also available as a Word file from  http://imageweb.zoo.ox.ac.uk/pub/2012/publications/Shotton-Twenty_Questions_for_Research_Data_Management.docx.]

These twenty questions are designed to prompt and assist your thinking, as a research student, a postdoc or an academic researcher at the beginning of a research project, and to form the basis of a workable research data management plan that can both guide your on-going data management activities and inform others about the nature and availability of your research data.

They will help you determining how best to safeguard your data from loss, how to describe your datasets in ways that assist both yourself when returning to them in the future and others in their subsequent interpretation, and how to publish your data in ways that maximize their usefulness to others and bring maximum academic scholarly credit to yourself, to reward your efforts in acquiring, analysing, describing, interpreting and publishing them in the first place.

You may not have immediate answers to all these questions.  But, by seeking advice from your research supervisor, colleagues and others in your institution with responsibilities for data management, you should endeavour to discover them.  Then, once in a while, you should revisit these questions and see whether your data management practices can be improved, updating your answers.

More detailed data management planning questions are available online, and a comparison of those with these Twenty Questions will be the subject of a subsequent blog post.

The nature of your data

1       What is the general subject discipline (domain, field) to which your research data relates?

Possible responses:

  • Quantum physics.
  • Cell biology.
  • Ornithology.

2       What is the exact nature (range, scope) of your research data?

Possible responses:

  • Long-distance quantum communication using entangled photons.
  • Protein chemistry and electron microscopy of cell membrane proteins.
  • Video field recordings of avian behaviour, and their quantitative analysis.

3       Who will own the data arising from your research, and the intellectual property rights relating to them?

Possible responses:

  • Myself alone.
  • Myself and my research group leader.
  • My university.

4       If you know at this stage, specify in what format(s), will you store your data in the short term after acquisition?

Possible responses:

  • Questionnaire response data will be stored on my laptop in a Microsoft Office Access 2007 database.
  • Raw video recording on digital video tapes on the shelf above my desk, edited videos in .mov format on my laptop. numerical analyses in a spreadsheet (Microsoft Office Excel 2007 format) on my laptop.
  • Numerical analyses in a spreadsheet (Microsoft Office Excel 2007 format) on my laptop.
  • On my research group’s cloud-based secure DataStage research data file store, in Zeiss confocal 3D image format.

Date descriptions, so that someone else can understand what the data are about (i.e. metadata, “data about data”)

5       When and where will you describe each of your research datasets?

Possible responses:

  • The only description will be the filenames on my hard drive.
  • I will describe the data using handwritten notes in my lab notebook if and when I have time, after the experiments have been completed – hopefully I’ll be able to remember all the details.
  • I will describe the data using the column and row labels in my spreadsheets after the data have been analysed.
  • I will create descriptive metadata for each dataset as I create/acquire it, and will save these descriptions with my datasets on my hard drive.

6 How will descriptive metadata be created or captured?

Possible responses:

  • Instrument metadata are automatically included in each data file.
  • I will create a title and short textual description for each dataset using the supplied submission interface when I submit the dataset to my university’s data repository.
  • My data descriptions will be saved in spreadsheets or word processor documents.
  • I will create rich metadata conforming to a Minimal Information Standard appropriate to my research field will be recorded at the time of data acquisition, using a metadata entry form to ensure I don’t miss any essential information.  This metadata file will be saved locally with my dataset, and eventually will be deposited with the dataset when it is submitted to a data repository.

Data sharing and publication

7       With whom will you share your research data in the short term, before publication of any papers arising from their interpretation?

Possible responses:

  • My research supervisor only.
  • Members of my research group and trusted external collaborators.
  • Anyone who asks for them.
  • Everyone, by publishing the data online, since our research community is committed to the rapid sharing of research results.

8      For how long will you embargo your research data before it is published for others to see and use?

Possible responses:

  • We will allow immediate public access to the data.
  • For one year, to permit us to exploit our hard-won research results.
  • Until the journal article describing our results has been published.

9      Why is public access to your research data to be restricted (if indeed it is)?

Possible responses:

  • We intend to make a patent application, and must avoid prior disclosure.
  • Don’t want to make locations of members of endangered species available to poachers.
  • The research data are confidential because of the arrangement my research group has made with the commercial partner sponsoring our research.
  • My data form part of a long-term study upon which my research group is entirely reliant for its on-going research publications and academic reputation.  We only share this with trusted colleagues.
  • Confidential human patient data.
  • Questionnaire data collected in confidence from individuals – anonymized averaged data will be published.

10      Under what data-sharing license will you publish your research data?

Possible responses:

  • What is a data-sharing license?
  • Under a Creative Commons Open Data CC Zero public domain dedication and waiver, since my research data are not covered by copyright.
  • Using a Creative Commons Attribution License, since my image data are covered by copyright.

11      What persistent identifier will be used to permit correct citation of your datasets?

Possible responses:

  • This URL: http://****.
  • A Digital Object Identifier (DOI).
  • The accession number for the dataset issued by the European Bioinformatics Institute database to which the dataset is submitted.

12      What metadata will be published with the data to make them interpretable and reusable?

Possible responses:

  • I will expect users to be able to interpret the column and row labels in my spreadsheets.
  • The dataset will be described in the journal article we will publish, but will have no other metadata beyond those required by the repository for data citation: Author, Date, Title, Source, Identifier.
  • An XML metadata file created in conformance with a Minimal Information standard will be submitted to the repository as part of the data package, along with the data files.

Data storage, backup and archiving

13       Where will you store your data in the short term, after acquisition?

Possible responses:

  • On my laptop.
  • On the computer connected to the microscope.
  • On my research group’s DataStage filestore.

14       Who is responsible for the immediate day-to-day management, storage and backup of the data arising from your research?

Possible responses:

  • Myself alone.
  • My research group’s data manager.
  • Our departmental IT staff, who manage our research group’s DataStage research data management system.

15      How frequently will your research data be backed up for short-term data security?

Possible responses:

  • Whenever I remember to do so.
  • Nightly, using our research group’s DataStage research data management system connected to the University’s automated backup service.

16      Where will your research data be archived for long-term preservation?

Possible responses:

  • Selected data will be included in the figures and tables of research papers published by my research group, but we have no plans to archive and publish the full datasets.
  • As supplementary files attached to my journal articles on the publisher’s web site.
  • In the University’s DataBank data repository, run by the library service.
  • In appropriate genomics databases run by the European Bioinformatics Institute.

17      When will your research data be moved to a secure archive for long-term preservation and publication?

Possible responses:

  • Our research data are already securely stored in an institutional data server.
  • Nightly.
  • Upon completion of each set of experiments.
  • When my research group leader decides it is appropriate.
  • Immediately after publication of my thesis.
  • Upon submission of our Nature paper, so that the data are available for reviewers.

18      Who will decide which of your research data are worth preserving?

Possible responses:

  • Myself alone.
  • Myself, in consultation with my research supervisor.
  • My research supervisor alone.

19      How (i.e. by what physical or electronic method) will you transfer your research datasets to their long-term archive, under the curatorial care of a separate third-party, e.g. a data repository?

Possible responses:

  • On physical hard drives that I will bring back from my field site by air.
  • By e-mailing files to our librarian.
  • By completion of the selected data repository’s Web-based submission form and uploading of the data files over the Internet.
  • By use of a local data management system such as DataStage that can automatically package and submit data files to the selected repository.

20      Who will be responsible for your data, once you have left your present research group?

Possible responses:

  • At this stage, I have no idea.
  • I’ll take my data with me and maintain responsibility.
  • My supervisor will make appropriate arrangements.
  • I hope the journal will maintain access to the supplementary information files associated with my article.
  • My University will assume long-term responsibility for the data I have chosen to preserve in its data archive.

– – – – –

Notes

Creative Commons: Creative Commons is a non-profit organization that has developed a legal and technical infrastructure for the licensing of copyright material and data in a standardised and machine-readable manner, thereby facilitating open publication, sharing and innovation in the digital age.

DataStage and DataBank: DataStage is a simple research data filestore and repository data submission system, designed for deployment at the research group level.  DataBank is a data repository for archiving and publishing research data, designed for deployment at the institutional level.  Both are open-source services for local or cloud deployment developed together at Oxford University within the JISC University Modernization Fund DataFlow Project, and both are now available for third-party installation and use.  

European Bioinformatics Institute:  The European Bioinformatics Institute (EBI) houses Europe’s primary databases for molecular sequence data, genomics and bioinformatics, and shares data daily with similar institutions in the United States and Japan.

Minimal Information Standards for life science research specify minimal metadata requirements for certain types of research data, are integrated by the MIBBI Project (Minimum Information for Biological and Biomedical Investigations), and are described in Reference [1].

Reference

[1]      Taylor et al. (2008). Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnology 26 (8): 889-896. doi:10.1038/nbt0808-889.

– – – – –
Footnote:

These questions were revised on 22 March 2012, two weeks after they were first published, to simplify the wording, to remove some redundancy between questions, and to split compound questions into single questions. To keep the total number of questions to 20, two questions about when data would be collected and analysed have been removed.  The remaining twenty questions have been slightly re-ordered.

Following suggestions by Sally Rumsey of the Bodleian Library, minor revisions were then made on 11 June to the text of questions 5, 6 and 18, and to the possible responses for questions 14 and 18, in order to add clarity and remove ambiguities.  Question 20 was also moved to position 15, and the subsequent questions re-numbered (s0 that Question 18 is now Question 19, etc.).  The Notes were also edited to update the information on DataFlow and to delete the description of SWORDv2, considered to be too specialized.

On 9th May 2013, some questions were slightly changed in wording, and others swapped in position and renumbered to make the flow of questions more logical, to match changes to the online data entry form at http://www.miidi.org/dmp/.  Question 3 was swapped with question 4, and questions 8-12 were swapped with questions 13-20.  Some of the exemplar responses were also revised to make them more useful.

A list of the original questions follows.

Original Twenty Questions published on 7 March 2012

1        What is the subject discipline (domain, field) to which your research data relates?

2        What is the exact nature (range, scope) of your research data?

3        When will your research data be collected?

4        When will your research data be processed and analysed?

5        Who owns the data arising from your research, and the intellectual property rights relating to them?

6        How will your research datasets be described, i.e. with what metadata or accompanying interpretive information will they be accompanied, and how will these metadata be created?

7        Where, and in what format(s), will you store your data in the short term after acquisition?

8        Who is responsible for the immediate day-to-day management, storage and backup of the data arising from your research?

9        How frequently and where will your research data be backed up for short-term data security?

10       With whom will you share your research data in the short term, before publication of any papers arising from their interpretation?

11       Why is access to your research data to be restricted in the short term (if indeed it is)?

12       To whom will you provide access to your research data in the long term, with what limitations as to re-use, and under what license arrangements.

13       Why is access to your research data to be restricted in the long term (if indeed it is)?

14       How (i.e. by what physical or electronic method) are your research datasets to be transferred from short-term storage under the local care of yourself or your research group to their long-term archival and Web publication destination under the curatorial care of a separate third-party, e.g. a data repository?

15       Where will your research data be archived for long-term preservation?

16       When will your research data be moved from your own local storage to a secure archive for long-term preservation (e.g. your institutional library’s data repository)?

17       Who has authority to decide which of your research data are NOT worth preserving and will be deleted?

18       Where will your research data be published for others to see?

19       When will your research data be published in this manner?

20       To whom will responsibility for the long-term preservation of your research data devolve, once you have left your present research group?

This document is licensed under a Creative Commons Attribution 3.0 Unported License.

Posted in JISC | Tagged , , , , , , , , , | 11 Comments

How should a data management plan be structured?

We recently held a small meeting to discuss the teaching of good data management practice to graduate students at the University of Oxford, which at present happens only to a very limited extent.  Allowing for our shared roles, those attending included researchers at all levels (a graduate student, a research fellow, two research group leaders), three teachers (from an academic department and the University’s Computing Services IT Learning Programme), and two professional project/data managers.  We were all members of Oxford University with the exception of Jez Cope (ICT Project Manager, Centre for Sustainable Chemical Technologies, Department of Chemistry, University of Bath), who had been invited to attend after expressing interest in this area to me during a recent JISC Managing Research Data conference.  Some important ideas emerged about data management planning and data management plans (DMPs), particularly as they relate to graduate students, that I will try to encapsulate here.

First, appeals to altruism – the virtues of data publication for the greater good of science – are unlikely to succeed.  Data management training should first and foremost emphasise the benefits that good data management can bring to the students themselves, for example in terms of being better organized, working more efficiently, and being more able to assemble the right data quickly for inclusion in figures and tables during later thesis and article writing.  Such training should also emphasize the potential dangers of data loss if students do NOT make plans to manage – and particularly to back up – their data, employing salutary horror stories, such as the one described in the previous blog post, and photos of burned out computers after a laboratory fire, to illustrate the point.

While this discussion was undertaken in the context of graduate students, it was agreed that the following conclusions applied in equal measure to all researchers.  In order of decreasing importance, the issues of relevance surrounding data management were seen to be:

  1. Benefits of data management and data backup – “What’s in it for me?”
  2. Determining the intrinsic value of data – “How do I decide what data to keep and what to discard?”
  3. Issues of data confidentiality and data theft – “How do I avoid being scooped?”
  4. Issues of data publication and data citation – “How can I get personal credit for spending time on data management and for data publication?”
  5. Issues of data ownership – “Do I own my data, or does my supervisor or the University?”
  6. Administrative issues, such as compliance with institutional and funders’ requirements – “What is the minimum I need to do to get funded?”

While these might not be considered the ideal questions for researchers to be posing, from the point of view of those of us interested in open data publication, it was agreed that this was the reality of the situation on the ground (or rather, at the lab bench).

For this reason, it was felt that current data management planning tools had the wrong emphasis, being constructed from the ‘top down’ viewpoint of an institution or a data manager, in a way that was likely to be off-putting to the person completing the plan, rather than stressing initially what was of central importance to the researchers themselves, thereby increasing relevance and gaining their enthusiastic engagement.

For example, the DCC’s DMPonline tool, designed to create DMPs that will accompany grant applications, starts by asking the researcher to input information about funder’s requirements, institutional guidelines and other policies, and then proceeds through questions about data types, intellectual property rights, and altruistic data sharing and re-use.  Finally, after 42 other questions (in the DCC’s generic DMP), the researcher is finally asked:

  • Where (physically) will you store your data?
  • How will you back-up your data?
  • How regularly will back-ups be made?

It was thus thought that, in order to achieve widespread uptake of proactive data management planning across the university, the DMP questions that we would use need to be designed, on basis of the preceding points, to more clearly address those issues of primary concern to researchers, not only from the point of view of the Principal Investigator, but also that of the lowly graduate student.

Jez Cope and I agreed to have a crack at this, and we welcome input from others with similar concerns.  Feedback, please, to j.cope@bath.ac.uk and to david.shotton@zoo.ox.ac.uk.  Watch this space!

Posted in JISC | Tagged , , , , , , , , | Leave a comment

Why YOU need a data management plan

I recently learned of the following sad tale, reported by Peter Murray-Rust last August on his blog at http://blogs.ch.cam.ac.uk/pmr/2011/08/01/why-you-need-a-data-management-plan/.  I re-post just the image here, as a salutory story, and encourage people to read the original.

Peter wrote: “”The following appeared on noticeboards in the Chemistry Department – the Panton Arms is just 200 metres away.”

Apparently there was no data backup – all this poor student’s work was contained in the bag!

Please inform me <david.shotton@zoo.ox.ac.uk> if you know of similar unfortunate experiences.

Posted in JISC | Tagged , , | 1 Comment