DMP questions – comparisons and conclusions

This blog post compares sets of data management planning questions, following their alignment in a spreadsheet whose creation is described in the previous blog post, and draws general conclusions about such question sets.

Characteristics of each set of questions

The following data management planning question sets are aligned in the comparison spreadsheet, which is downloadable from here.

DCC’s DMP Online questions (Checklist version 3.0, dated 2 Feb 2012):

The DCC uses a large set of questions for its DMPonline system, 94 questions in all, which have been carefully ordered in a number of distinct sections. Great care has been taken to eliminate duplication of questions soliciting information about the same thing, but there are still a few compound questions that demanded multiple answers. If the DMP being created is to accompany a grant application, the questions asked of the user are tailored to the funding agency’s requirements. Once a project has been funded, the user is given access to the full set of questions to enable more detailed planning.

Areas that are covered in detail by the DCC’s DMPonline questions, but that have scant representation in the other question sets, include:

    • Resourcing
    • Plan adherence
    • Policy
    • Funders and budget
    • Use of existing data
    • Ethical issues
    • Likely reuse
    • Security

Many of the DCC’s questions ask why people are doing certain things, requiring discursive answers.

David Shotton’s Twenty Questions for Research Data Management:

This question set is designed to be concise, and deliberately omits metadata questions (e.g. “Who is the author of this DMP?”). Most of the questions map directly onto the DCC’s DMP questions, but there are three unique questions relating to methods and to who decides about data triage. To guide the user, each question is accompanied by exemplar short possible responses.

DMP Tool (USA):

Rather than give a prescriptive list of questions, the DMP Tool developed in the USA has a less structured approach, presenting the user with a small of sections to complete, each of which has a text box in which the user can enter free text on that particular topic, with guidance text and suggested questions to consider being provided in the accompanying Help Box. Most of the questions have been cherry picked from the DCC’s question set, as part of a collaboration between the two development project, with a few new ones added. The exact number of sections and questions suggested vary according to which funding agency has been selected.

The following text, taken from the National Science Foundation Generic (NSF-GEN) option, is an example of the wording provided in a Help Box, this one for Section 1: Types of data produced:

Give a short description of the data, including amount (if known) and content. If the project will be collecting data of a sensitive nature, note here and reflect upon it in subsequent sections. Data types could include text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, etc. Consider these questions:

      • What data will be generated in the research?
      • What data types will you be creating or capturing?
      • How will you capture or create the data?
      • If you will be using existing data, state that fact and include where you got it.
      • What is the relationship between the data you are collecting and the existing data?

For simplicity, it was only the specific bulleted questions from the DMP Tool that I entered into my comparison spreadsheet. As a consequence, the full extent of the DMP Tool’s requested information is under-represented. For example, the DMP Tool column of the spreadsheet does not contain the specific question “What amount of data will be generated”, although that question is implied in the statement given above: “Give a short description of the data, including amount (if known) and content.”

DataTrain Questions for Post-Graduate Research Projects:

This question set is designed specifically for use by graduate students commencing their research projects. As such, it contains no questions about funders or resourcing. It is the smallest of those sets being compared, containing just six metadata questions and six data questions (one if which is actually about the student’s thesis). It contains no questions about short-term data storage and backup, nor about data publication, and as such is insufficient for creating an effective DMP.

Jez Cope’s Questions for Post-Graduate DMPs (University of Bath):

This question set is also designed specifically for use by graduate students commencing their research projects, but is more comprehensive than the DataTrain set, containing seven basic metadata questions and (coincidentally) twenty questions about the data. Uniquely, it contains five questions that no-one else had thought to ask, about actions to be undertaken as a result of creating the DMP, about the frequency of data acquisition and the volume of data, and about versioning of data files.

Comparison of the question sets

After aligning the different sets of data management planning questions in the spreadsheet as described in the previous blog post, I was able to compare them more easily.

First, I looked down each set of questions and marked

      • in green those questions that were compound, demanding two or more answers;
      • in plum those duplicate questions that asked for the same or similar information in different ways; and
      • in blue questions that were unique to each of the smaller sets, using the DCC questions as the primary comparator.

The numbers of such questions in each set are shown in the following table:

Source

Plan questions

Project questions

Data questions

Compound questions

Duplicate questions

Unique questions

DCC DMPonline

17

14

63

3

0

(Reference)

Shotton (Oxford)

0

0

20

0

0

3

DMP Tool (NSF)

12

0

30

5

5

3

DMP Tool (NIH)

9

0

8

1

2

2

DataTrain

4

2

6

1

0

1

Jez Cope (Bath)

5

2

20

1

2

5

Compound and duplicate questions

One thing that quickly became clear while I was creating my own Twenty Questions for Research Data Management was how easy it was inadvertently to ask compound questions requiring two or more answers, or to ask for the same information twice in slightly different ways. By revising my own questions, splitting compound questions into pairs of single questions, and rewording and combining others, I managed to eliminate these problems from the published Twenty Questions, but similar problems are present in the other question sets.

Examples of compound questions:

      • DCC question 6.3.2: How will this metadata/documentation be created, and by whom?
      • DMP Tool: Who will hold the intellectual property rights to the data and how might this affect data access?
      • Jez Cope: What should/shouldn’t be shared and why?

Examples of duplicate questions in the DMP Tool (NSF):

      • What contextual details (metadata) are needed to make the data you capture or collect meaningful?
      • What metadata/documentation will be submitted to make the data reusable?
      • What related information will be deposited?

Examples of duplicate questions in Jez Cope’s set:

      • Who else should reasonably have access to your data?
      • Who should have access and under what conditions?

Institution-specific questions

There were two institution-questions that were impossible to understand outside the specific context in which they were asked.

Institution-specific questions in the DMP Tool set:

Solicitation number

Institution-specific questions in the DataTrain set:

3 Has a ‘File Structure/Naming Form’ been completed?

Unique questions

Surprisingly, given the number and variety of questions comprising the DCC’s question set, each of the other smaller question sets contained metadata questions or data questions that were unique. Jez Cope’s question set was remarkably in containing five unique questions – two unique metadata questions out of a total of seven, and three unique data questions out of a total of twenty – questions that no-one else had thought to include in their sets.

All the unique questions are listed below:

Unique data questions from Shotton’s Twenty Questions:

(note – question numbers changed to match updated 20 Questions)

5 When and where will you describe each of your research datasets?

Possible responses:

  • The only description will be the filenames on my hard drive.
  • I will describe the data using handwritten notes in my lab notebook if and when I have time, after the experiments have been completed – hopefully I’ll be able to remember all the details
  • I will describe the data using the column and row labels in my spreadsheets after the data have been analysed.
  • I will create descriptive metadata for each dataset as I create/acquire it, and will save these descriptions with my datasets on my hard drive.

18 Who will decide which of your research data are worth preserving?

Possible responses:

  • Myself alone.
  • Myself, in consultation with my research supervisor.
  • My research supervisor alone.

19 How (i.e. by what physical or electronic method) will you transfer your research datasets to their long-term archive, under the curatorial care of a separate third-party, e.g. a data repository?

Possible responses:

  • On physical hard drives that I will bring back from my field site by air.
  • By e-mailing files to our librarian.
  • By completion of the Web-based database submission form and uploading of the data files over the Internet.
  • By automated data packaging and repository submission over the Web from my local DataStage filestore, using the SWORD repository submission protocol.

Unique metadata questions from DMP Tool (NSF-BIO and NIH):

Plan name

Comment

Unique data questions from DMP Tool (NSF-BIO):

What data types will you be creating or capturing?

What procedures does your intended long-term data storage facility have in place for preservation and backup?

Unique metadata questions from DataTrain:

Version (number)

Date amended

Unique metadata questions from Jez Cope:

What actions have you identified from the rest of this plan?

What further information do you need to carry out these actions?

Unique data questions from Jez Cope:

How often do you get new data?

How much data do you generate?

What different versions of each data file do you create?

Because of its size, the DCC’s question set contained many questions that are not present in the other sets, or that are on topics represented by fewer questions. (But see remarks above about the DMP Tool questions.) Of these unique DCC questions, the most significant are listed below. Note how some of these relate to policy issues or administrative issues about which a typical research scientist is likely to have little knowledge, which would thus be difficult and time-consuming to answer satisfactorily:

Unique metadata questions from DCC’s DMPonline:

1.4.2 Aims and purpose of this plan

1.4.3 Target audience for this plan

1.3.1 Funding body requirements relating to the creation of a data management plan

1.3.2 Institutional or research group guidelines

1.3.3 Other policy-related dependencies

7.2 How will data management activities be funded during the project’s lifetime?

7.3 How will longer-term data management activities be funded after the project ends?

8.2.2 Who will carry out reviews?

Unique data questions from DCC’s DMPonline:

2.2.1 Have you reviewed existing data, in your own institution and from third parties, to confirm that new data creation is necessary?

2.2.3 Describe any access issues pertaining to the pertinent existing data

2.3.1 Why do you need to capture/create new data?

2.3.4 What criteria will you use for Quality Assurance/Management?

2.4.2 How will you manage integration between the data being gathered in the project and pre-existing data sources?

2.4.3 What added value will the new data provide to existing datasets?

3.1.3 Is the data that you will be capturing/creating “personal data” in terms of the Data Protection Act (1998) or equivalent legislation if outside the UK?

3.1.4 What action will you take to comply with your obligation under the Data Protection Act (1998) or equivalent legislation if outside the UK?

3.2.1 Will the dataset(s) be covered by copyright or the Database Right?

3.2.4 For multi-partner projects, what is the dispute resolution process / mechanism for mediation?

4.2.1 Does the original data collector/ creator/ principal investigator retain the right to use the data before opening it up to wider use?

4.3.1 Which groups or organisations are likely to be interested in the data that you will create/capture?

4.3.2 How do you anticipate your new data being reused?

5.3.1 How will you manage access restrictions and data security during the project’s lifetime?

5.3.2 How will you implement permissions, restrictions and/or embargoes?

5.3.3 Give details of any other security issues.

6.2.1 Will or should data be kept beyond the life of the project?

6.2.5 On what basis will data be selected for long-term preservation?

6.2.7 Will transformations be necessary to prepare data for preservation and/or data sharing?

6.3.5 How will you address the issue of persistent citation?

6.3.3 Will you include links to published materials and/or outcomes?

Missing data management planning questions

Some important questions were not asked by anyone!

Missing metadata questions:

Who is the Principal Investigator of the research project to which this DMP relates?

What is the Principal Investigator’s department?

Does the research described in this plan require approval by an ethics committee?

Is this DMP confidential, or will it be made public after a positive funding decision on the grant application for the project to which it relates?

What is the unique identifier for this DMP?

Missing data questions:

Do/will the datasets described in this DMP contain experimental or observational data that it would be impossible to re-acquire or re-collect (e.g. sub-atomic particle disintegration data, seismic records, animal behaviour observations, astronomical or meteorological data, questionnaire responses)?

Are metadata created automatically for any of your data (e.g . time, date and geo-location information captured in the Exif header of an image file)?

If so, please define the data types and specify the nature of the accompanying metadata.

To what published journal article(s) do the data covered by this DMP relate?

Conclusions

      1. Creating an appropriate question set for DMPs is difficult work, since there are many possible questions one could ask about data management.
      2. Care needs to be taken to avoid asking ambiguous questions and questions that require more than one answer, and to avoid asking for the same or similar information multiple times in different questions.
      3. Despite the comprehensiveness of the DCC’s DMPonline question set, each of the other question sets has unique questions not covered by the DMPonline set.
      4. All of the available question sets have drawbacks, and some have unique strengths.
      5. In terms of comprehensiveness, the best may be the enemy of the good enough.
      6. Further work needs to be done by the community as a whole to build on the work undertaken so far, and to devise and standardize the best possible set of questions for different constituencies of user, e.g.
        1. Applicants for research grants, who need concise DMPs tailored to funders’ requirements, including the possibility of including standardized institutional answers to certain questions (for example about data backup facilities and the institutional data repository).
        2. Students and researchers starting research projects on existing funding, who have no need for questions about funders and budgets, but who need to be encouraged to think about the nitty-gritty of their own data management tasks.
      7. There will always be a need for individuals or institutions to be able to add their own specific questions to a question set, questions that are particular to their situation and could not have been anticipated by those devising the question set.

This last point relates to the functionality of the software tools that permit answers to data management planning questions to be collected and a formal data management plan to be created. I will compare these tools in a subsequent post.

Advertisements
This entry was posted in Data management planning, JISC and tagged , , , , , , , , , . Bookmark the permalink.

6 Responses to DMP questions – comparisons and conclusions

  1. Pingback: DMP questions – description and alignment | Creating data management plans online

  2. Pingback: A month in the life of a Panton Fellow: April 2012 | The Stilettoed Mathematician

  3. Really useful analysis, David.

    Martin and I are meeting tomorrow to discuss how to overhaul the DCC Checklist. We hope to make it much shorter, simpler and more closely matched to funder requirements.

    We hope to get feedback from you and others in the community so will circulate drafts for comments.

    Sarah

  4. Pingback: A month in the life of a Panton Fellow: May 2012 | The Stilettoed Mathematician

  5. Pingback: iridium – early findings on research data management planning (approaches, tools and writing plans) « iridium

  6. Is some of this variety due to the range of information required by different funders? For example, the AHRC Technical Annex is structured in a very different way to the DMPs that most funders are interested in. I know the DCC tried to collect all the questions and winnow them down from there…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s