| July 24, 2024
Here, Julien Colomb presents in a very short post the idea of having and attribution reference and a provenance reference for an hardware/software/dataset object. We present 2 objectives of citing data, one workaround for having it all, what is would mean for hardware paper and end with a question about the validity of this ideas. He is hoping that it may open a discussion on this topic.
The two objectives of citing data
People have been publishing datasets for decades now, and there is ongoing research on the apparent trade off between providing persistent identifiers to the data and allowing data versions. During a workshop of data publisher professionals (“Forschungsdaten-Publikationen zwischen Dynamik und Persistenz” , 28.06.2024 in Berlin, Klump et al. (n.d.)), there were discussions about dealing the granularity of the data in different projects, and its versioning (should modification of and addition to the dataset be treated differently, how to deal with aggregates of datasets,..). In the discussion, an important aspect of data citation was mentioned: There are two different purpose to cite data. In one hand, data is cited for transparency and reproducibility purposes. Authors cite the exact version of the data they used so their analysis can be reproduced and the context of their study can be apprehended. On the other hand, data is cited to give attribution to the team who provided the data. In this second case, authors want to give attribution and it would therefore be most efficient if different authors would cite the same object, even if the data changed a bit. Indeed, having one research output cited 50 times is valued much more than having 50 outputs cited once in the current academic metric system.
This is reminiscent of the Citation File Format (Druskat et al., n.d.) design used in software publication. While software is further developed and different versions will get different doi, the software authors are able to set one “preferred-citation” field such that the same element will be cited for different versions of the software. This will allow authors using the software to mention the doi of the software version they are using as well another element which would recognize the work of all contibutors of the software. This second element could be a scientific paper or a “Concept doi” on zenodo European Organization For Nuclear Research and OpenAIRE (n.d.) that relates to all version and point to the latest release. Similarly, R packages sometimes have multiple elements to be cited. For instance, this website was created using Blogdown, which provides citation for the package Xie, Dervieux, and Presmanes Hill (2024), and its guide Xie, Hill, and Thomas (2017).
Working around
While in principle one should then cite two objects for one dataset or software, this may be problematic when the number of references is limited, and it may bring confusion during the reading. To a human reader, the reference to the particular version is only useful in the text, not in the reference list (the situtation might be different when scaping paper metadata ?).
citing this blog post as Colomb et al. (2024), v: unpublished version ??
Another solution would be to treat a software like a book (concept software, with a whole list of contributors) with chapters (each version being similar to a chapter). For instance, reference to the 2022 blog post entries on this site could be (Colomb et al. (2024), v: 0.9 10.5281/zenodo.10592333), while more recent one would be (Colomb et al. (2024), v: 0.91 10.5281/zenodo.12805477), and this one would at the moment be (Colomb et al. (2024), v: unpublished version). This is what I call here attribution reference (analog to the book) and a provenance reference (analog to the chapter).
This was created manually here, but one could think of a way to get that type of citation when entering the version specific doi in a reference manager ? Would there be a place in the datacite and crossref metadata schemas for this (something similar to the preferred citation field in the citation file format) ?
What about data/software/hardware papers ?
As stated above, the attribution reference could point to an academic paper. The relative success of the Journal of Open Source Software (2574 papers published to date) shows that the format is attractive to researchers, probably because it is a known format in the academic world.
If the papers is then mostly used to give attribution, one might then emphasize this aspect in the paper itself. This would mean that these type of papers should be dedicated to provide information about author contributions, funding information, and citation of other projects that were important in the development of the object. This calls for a better definition of authorship and contribution roles (and make it computer readable) for data, software and hardware. Initiatives like the Research Software Authorship and Contribution Task Force are important steps towards better recognition of that work.
Am I off ?
I am looking forward to get feedback on this idea and will use this blog post as an entry point for such discussions. I am sure I am not the first one to think of this, and you may have some more experience to share.
You can find ways to contact me on my orcid page (https://orcid.org/0000-0002-3127-5520), on mastodon at @jcolomb@nerdculture.de, may see and add comments via https://nerdculture.de/@jcolomb/112852147363418957.
# References