Exploring Re-use of Datasets in Institutional Data Repositories


As a general rule, academics should cite sources any time that you use someone else's words, methods, data or ideas in a piece of your own research. With data and code, we see many ways in which re-use occurs, whether it be directly incorporating the data into your own raw data, running someone else's analysis code on your own data, or even re-visualising the outputs. You can re-use your own data of course, in the same way you cite your own papers as proof of previous findings.

Pyle, David; Parks, Michelle; Mather, Tamsin; Nomikou, Paraskevi (2014): 2012 Santorini LiDAR data taken from Figshare.

We know that paper publication’s peak citation rate usually happens 2 years after publication. With datasets or other research outputs that are published to support a paper, there is immediately 1 link to the dataset when the journal article is published. This leads to an awful lot of datasets having 1 citation. In fact, until recently, every Dryad dataset had to be associated with a publication, so every dataset had at least 1 citation.

It should therefore be recognised that real reuse of data is only apparent when a dataset has >1 citation, ideally from papers with different authors.

Figshare infrastructure uniquely tracks citation counts to all of the outputs in an Institution’s repository. We do this by looking in the full text of articles for citations to DOIs and not just the reference list. We do this through a partnership with our sister company, Dimensions. Whilst it is still too early to find conclusive evidence of what drives reuse of data, we can investigate a handful of examples as published on our institutional clients infrastructure.

Institution Output Type Citations from papers Papers with distinct authors
Stockholm University https://doi.org/10.25378/janelia.6163622.v6 Dataset 6 3
Loughborough University https://doi.org/10.17028/rd.lboro.6176450.v1 Dataset 4 2
Royal Holloway - University of London https://doi.org/10.17637/rh.7000520.v4 Code - Listed as Dataset 3 3
University of Adelaide https://doi.org/10.25909/5becfa45c176f Dataset 3 3
HHMI - Janelia https://doi.org/10.25378/janelia.6163622.v6 Dataset 3 3

In the above examples, we see a good mix of citations by the same authors, suggesting re-use in the same way that papers are self cited. However, each one of them has been cited by another research group. We want to start peeling back the layers on understanding ‘why’ these research groups have re-used and cited the data/code.

Is compliance with the FAIR data principles a prerequisite for data re-use?

4 of the 5 that were randomly sampled above, would seem to comply with the concept of Findable, Accessible, Interoperable and Re-usable by both humans and machines. The one with the lightest metadata has the following as a description “The data comprises phenotypic measurements from two field trials in South Australia and genotyping information for more than 500 diverse wheat accessions.” There is no README file. And yet this has been re-used and cited by another research group.

Some datasets are better set up for re-use, with authors at the Janelia Research Campus even suggesting ways to reuse the data:

“Some potential projects to do with this data:

1) peer prediction: how well can you predict a neuron from the other 10,000? Can you beat our score?

2) face prediction: how well can you predict a neuron from the behavioral patterns on the face videos?

3) manifold discovery: can you find a nonlinear low-dimensional embedding? how low can it go?”

Interestingly, when we examine the citations, we see the following: “These findings agree well with those of the Stringer et al” and “We test our method on a publicly available calcium imaging data set.”

The amount of logic needed to query why people are citing and re-using non traditional content on institutional repositories means that we are just scratching the surface in understanding what drives researchers to reuse data, let alone understanding how we can encourage this behaviour. The fact that it is happening is a large societal change for a lot of academic research areas. That alone is massively encouraging to see. As the numbers for data citation grow, we will continue to try to understand what encourages these steps. If you’d like to play with the citation data yourself, please get in touch at info@figshare.com

If you would like to track more of the research outputs coming out of your institution, get in touch to learn how you can get your own ‘out of the box’ data repository set up, on your own domain and with your own DOIs. Check out https://knowledge.figshare.com/institutions for more details.

Jun 8, 2020 14:49

This is a reposted article from our blog, the original article can be found at:


get in touch with us

drop us a line